[Exploratory Data Analysis] How to perform multivariate analysis

Let me clarify from the beginning that this is not an extensive or in-depth guide to multivariate analysis. Fortunately, we can find many online resources on that. It is more like a beginners’ friendly guide. In our adventure, we will use the Palmer Archipelago (Antarctica) penguin data dataset that comes as a kind of alternative to the famous Iris dataset. You can find it here.

So, let’s load the dataset in a data frame, separate numerical and categorical values, as usual:

df = pd.read_csv('penguins.csv')

col_names = df.columns

print(f'\nColumn names {col_names}')

# split categorical and numerical
numerical = [var for var in df if df[var].dtypes != object]
print(f'\nNumerical features {numerical}')

categorical = [var for var in df if df[var].dtypes == object]
print(f'\nCategorical features {categorical}')

and the output looks like this:

Numerical features ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
Categorical features ['species', 'island', 'sex']

Let’s start by exploring the categorical features:

# Explore categorical values
for col in df[categorical]:
    print(f'In column {col} there are {df[col].nunique()} values: {df[col].unique()}')
    print(f'\nwith distributions\n{df[col].value_counts()}\n')

getting an output like:

In column species there are 3 values: ['Adelie' 'Gentoo' 'Chinstrap']
with distributions
Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

In column island there are 3 values: ['Torgersen' 'Biscoe' 'Dream']
with distributions
Biscoe       168
Dream        124
Torgersen     52
Name: island, dtype: int64

In column sex there are 2 values: ['male' 'female' nan]
with distributions
male      168
female    165
Name: sex, dtype: int64

In feature ‘sex’, we can see that we have some nan values and that the feature is balanced since we have almost 50% male and 50% female penguins. It will be nice to fill these nan gaps by using the SimpleImputer from scikit-learn and, as a strategy, select the “most_frequent”, then replace missing using the most frequent value along the column.

imr_c = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imr_c = imr_c.fit(df[['sex']])
df['sex'] = imr_c.transform(df[['sex']])

Now, let’s check the numerical features and see if there are any missing values.

# check for missing values
missing_values = df.isnull().sum()
print(f'\nMissing values in the dataset:\n{missing_values}')

# check the percentage of the for missing values
print(f'\nMissing values in percentages')
for col in df:
    if df[col].isnull().mean() > 0:
        print(col, round(df[col].isnull().mean(), 4))

and we get:

Missing values in the dateset:
species              0
island               0
bill_length_mm       2
bill_depth_mm        2
flipper_length_mm    2
body_mass_g          2
sex                  0
dtype: int64

Missing values in percentages:
bill_length_mm 0.0058
bill_depth_mm 0.0058
flipper_length_mm 0.0058
body_mass_g 0.0058

okay, we see that in the four numerical columns, we miss 2 values in each. So, it is good to have a look at the distribution shape of each feature. If we have many outliers, we prefer to impute the media value; if not, mean is also a valid option.

Boxplots and histograms always give a nice view of the distributions:

# draw boxplots to visualize outliers
plt.figure(figsize=(15, 10))

plt.subplot(2, 2, 1)
fig = df.boxplot(column='bill_length_mm')

plt.subplot(2, 2, 2)
fig = df.boxplot(column='bill_depth_mm')

plt.subplot(2, 2, 3)
fig = df.boxplot(column='flipper_length_mm')

plt.subplot(2, 2, 4)
fig = df.boxplot(column='body_mass_g')

# plot histogram to check distribution
plt.figure(figsize=(15, 10))

plt.subplot(2, 2, 1)
fig = df['bill_length_mm'].hist(bins=10)

plt.subplot(2, 2, 2)
fig = df['bill_depth_mm'].hist(bins=10)

plt.subplot(2, 2, 3)
fig = df['flipper_length_mm'].hist(bins=10)

plt.subplot(2, 2, 4)
fig = df['body_mass_g'].hist(bins=10)


There are no outliers, nor the distributions are highly skewed, so mean is a nice strategy for imputation of missing values.

imr_n = SimpleImputer(missing_values=np.nan, strategy='median')
imr_n = imr_n.fit(df[numerical])
df[numerical] = imr_n.transform(df[numerical])

At this point, it seems that housekeeping is over. The correlation matrix gives us information about the linear relationship of the variables. Let’s have a look:

correlation = df.corr()
plt.figure(figsize=(16, 12))
plt.title('Correlation Heatmap')
ax = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', linecolor='white')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=30)

It’s clear the flipper_length_mm and the body_mass_g features are positively correlated with a coefficient equal to 0.87

The pair plots will make it even more clear:

sns.pairplot(df[numerical], kind='scatter', diag_kind='hist', palette='Rainbow')

Another approach is to experiment with Principal Components Analysis (PCA). You can read a very nice document regarding that in this blog. Let’s first scale the data using the StandardScaler from Scikit-Learn, like that (for the numerical features):

# scale features in the data before applying PCA
x = StandardScaler().fit_transform(df[['flipper_length_mm', 'bill_length_mm', 'bill_depth_mm', 'body_mass_g']])
x = pd.DataFrame(x)

print(f"\nStatistics after normalisation")

and yes the values prove that this happened:

Statistics before normalisation
       bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
count           342.0          342.0              342.0        342.0
mean             44.0           17.0              201.0       4202.0
std               5.0            2.0               14.0        802.0
min              32.0           13.0              172.0       2700.0
25%              39.0           16.0              190.0       3550.0
50%              44.0           17.0              197.0       4050.0
75%              48.0           19.0              213.0       4750.0
max              60.0           22.0              231.0       6300.0 2

Statistics after normalisation
                  0             1             2             3
count  3.420000e+02  3.420000e+02  3.420000e+02  3.420000e+02
mean  -8.725963e-16  1.662088e-16  3.739699e-16  4.155221e-17
std    1.001465e+00  1.001465e+00  1.001465e+00  1.001465e+00
min   -2.059320e+00 -2.168526e+00 -2.054446e+00 -1.875362e+00
25%   -7.773731e-01 -8.615697e-01 -7.866355e-01 -8.138982e-01
50%   -2.788381e-01  9.686524e-02  7.547549e-02 -1.895079e-01
75%    8.606705e-01  8.397670e-01  7.854492e-01  6.846384e-01
max    2.142618e+00  2.875868e+00  2.205397e+00  2.620248e+00

At this point, we are in a position to understand what is going on with the importance of data. Using the cumulative explained variance graph we can identify the components that needed to describe the dataset based on the required variance.

pca = PCA().fit(x)
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
# plt.show()


and the variance ration of each component:

[0.68837221 0.19313603 0.09136061 0.02713115]

Hmm, that means that using 3 components, we can capture approximately 97% of the variance.


Posts from Shubhi Saxena and Prashant Banerjee