[Exploratory Data Analysis] How to perform univariate analysis

Let’s assume that we are about to start working on a new classification problem. Before we start, it’s always a good idea to perform a univariate analysis on the target variable. This is helpful because it’s going to give us insights into the distribution of the target. What if we discover that the dataset is imbalanced? We might need to follow a different approach in our remaining work.

The example that I’m going to work on is using a dataset about Stroke Prediction (11 clinical features for predicting stroke event) and can be found in the usual place, called me kaggle.

Let’s grab the dataset that we have downloaded in our project

df = pd.read_csv('healthcare-dataset-stroke-data.csv')

First, we need to check if there are any missing values

null_values = df['stroke'].isnull().sum()
print(f'Null values: {null_values}')

and the result is ‘Null values: O‘.

Now, it is time to get an idea of how many unique values we have in the target and which.

num_of_unique_values = df['stroke'].nunique()
unique_values = df['stroke'].unique()

print(f'Number of unique values {num_of_unique_values}: {unique_values}')

And the result is: ‘Number of unique values 2: [1 0].

Okay, it seems that we are in front of a Binary Classification problem. Still, it is crucial to find out the distribution of the unique values and the corresponding percentage:

distribution_frequency = df['stroke'].value_counts()
print(f'Distribution of unique values:\n{distribution_frequency}')

percentage_of_distribution_frequency = df['stroke'].value_counts()/len(df)
print(f'Percentage of distribution of unique values:\n{percentage_of_distribution_frequency}')

and the result is:

Number of unique values 2: [1 0]
Distribution of unique values:
0 4861
1 249
Name: stroke, dtype: int64
Percentage of distribution of unique values:
0 0.951272
1 0.048728

That means that class 1 represents 4% of the target. Really, let’s me see with a graph:

f, ax = plt.subplots(figsize=(8, 4))
ax = sns.countplot(y="stroke", data=df)
plt.show()

It’s clear now. At this point, the exploratory data analysis has been completed. We know what to expect!

References: Tutorials by Prashant Banerjee in kaggle