Let me clarify from the beginning that this is not an extensive or in-depth guide to multivariate analysis. Fortunately, we can find many online resources on that. It is more like a beginners’ friendly guide. In our adventure, we will use the Palmer Archipelago (Antarctica) penguin data dataset that comes as a kind of alternative to the famous Iris dataset. You can find it here.
So, let’s load the dataset in a data frame, separate numerical and categorical values, as usual:
df = pd.read_csv('penguins.csv') col_names = df.columns print(f'\nColumn names {col_names}') # split categorical and numerical numerical = [var for var in df if df[var].dtypes != object] print(f'\nNumerical features {numerical}') categorical = [var for var in df if df[var].dtypes == object] print(f'\nCategorical features {categorical}')
and the output looks like this:
Numerical features ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'] Categorical features ['species', 'island', 'sex']
Let’s start by exploring the categorical features:
# Explore categorical values for col in df[categorical]: print(f'In column {col} there are {df[col].nunique()} values: {df[col].unique()}') print(f'\nwith distributions\n{df[col].value_counts()}\n')
getting an output like:
In column species there are 3 values: ['Adelie' 'Gentoo' 'Chinstrap'] with distributions Adelie 152 Gentoo 124 Chinstrap 68 Name: species, dtype: int64 In column island there are 3 values: ['Torgersen' 'Biscoe' 'Dream'] with distributions Biscoe 168 Dream 124 Torgersen 52 Name: island, dtype: int64 In column sex there are 2 values: ['male' 'female' nan] with distributions male 168 female 165 Name: sex, dtype: int64
In feature ‘sex’, we can see that we have some nan values and that the feature is balanced since we have almost 50% male and 50% female penguins. It will be nice to fill these nan gaps by using the SimpleImputer from scikit-learn and, as a strategy, select the “most_frequent”, then replace missing using the most frequent value along the column.
imr_c = SimpleImputer(missing_values=np.nan, strategy='most_frequent') imr_c = imr_c.fit(df[['sex']]) df['sex'] = imr_c.transform(df[['sex']])
Now, let’s check the numerical features and see if there are any missing values.
# check for missing values missing_values = df.isnull().sum() print(f'\nMissing values in the dataset:\n{missing_values}') # check the percentage of the for missing values print(f'\nMissing values in percentages') for col in df: if df[col].isnull().mean() > 0: print(col, round(df[col].isnull().mean(), 4))
and we get:
Missing values in the dateset: species 0 island 0 bill_length_mm 2 bill_depth_mm 2 flipper_length_mm 2 body_mass_g 2 sex 0 dtype: int64 Missing values in percentages: bill_length_mm 0.0058 bill_depth_mm 0.0058 flipper_length_mm 0.0058 body_mass_g 0.0058
okay, we see that in the four numerical columns, we miss 2 values in each. So, it is good to have a look at the distribution shape of each feature. If we have many outliers, we prefer to impute the media value; if not, mean is also a valid option.
Boxplots and histograms always give a nice view of the distributions:
# draw boxplots to visualize outliers plt.figure(figsize=(15, 10)) plt.subplot(2, 2, 1) fig = df.boxplot(column='bill_length_mm') fig.set_title('') fig.set_ylabel('bill_length_mm') plt.subplot(2, 2, 2) fig = df.boxplot(column='bill_depth_mm') fig.set_title('') fig.set_ylabel('bill_depth_mm') plt.subplot(2, 2, 3) fig = df.boxplot(column='flipper_length_mm') fig.set_title('') fig.set_ylabel('flipper_length_mm') plt.subplot(2, 2, 4) fig = df.boxplot(column='body_mass_g') fig.set_title('') fig.set_ylabel('body_mass_g') plt.show() # plot histogram to check distribution plt.figure(figsize=(15, 10)) plt.subplot(2, 2, 1) fig = df['bill_length_mm'].hist(bins=10) fig.set_xlabel('bill_length_mm') plt.subplot(2, 2, 2) fig = df['bill_depth_mm'].hist(bins=10) fig.set_xlabel('bill_depth_mm') plt.subplot(2, 2, 3) fig = df['flipper_length_mm'].hist(bins=10) fig.set_xlabel('flipper_length_mm') plt.subplot(2, 2, 4) fig = df['body_mass_g'].hist(bins=10) fig.set_xlabel('body_mass_g') plt.show()
There are no outliers, nor the distributions are highly skewed, so mean is a nice strategy for imputation of missing values.
imr_n = SimpleImputer(missing_values=np.nan, strategy='median') imr_n = imr_n.fit(df[numerical]) df[numerical] = imr_n.transform(df[numerical])
At this point, it seems that housekeeping is over. The correlation matrix gives us information about the linear relationship of the variables. Let’s have a look:
correlation = df.corr() plt.figure(figsize=(16, 12)) plt.title('Correlation Heatmap') ax = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', linecolor='white') ax.set_xticklabels(ax.get_xticklabels(), rotation=90) ax.set_yticklabels(ax.get_yticklabels(), rotation=30) plt.show()
It’s clear the flipper_length_mm and the body_mass_g features are positively correlated with a coefficient equal to 0.87
The pair plots will make it even more clear:
sns.pairplot(df[numerical], kind='scatter', diag_kind='hist', palette='Rainbow') plt.show()
Another approach is to experiment with Principal Components Analysis (PCA). You can read a very nice document regarding that in this blog. Let’s first scale the data using the StandardScaler from Scikit-Learn, like that (for the numerical features):
# scale features in the data before applying PCA x = StandardScaler().fit_transform(df[['flipper_length_mm', 'bill_length_mm', 'bill_depth_mm', 'body_mass_g']]) x = pd.DataFrame(x) print(f"\nStatistics after normalisation") print(x.describe())
and yes the values prove that this happened:
Statistics before normalisation bill_length_mm bill_depth_mm flipper_length_mm body_mass_g count 342.0 342.0 342.0 342.0 mean 44.0 17.0 201.0 4202.0 std 5.0 2.0 14.0 802.0 min 32.0 13.0 172.0 2700.0 25% 39.0 16.0 190.0 3550.0 50% 44.0 17.0 197.0 4050.0 75% 48.0 19.0 213.0 4750.0 max 60.0 22.0 231.0 6300.0 2 Statistics after normalisation 0 1 2 3 count 3.420000e+02 3.420000e+02 3.420000e+02 3.420000e+02 mean -8.725963e-16 1.662088e-16 3.739699e-16 4.155221e-17 std 1.001465e+00 1.001465e+00 1.001465e+00 1.001465e+00 min -2.059320e+00 -2.168526e+00 -2.054446e+00 -1.875362e+00 25% -7.773731e-01 -8.615697e-01 -7.866355e-01 -8.138982e-01 50% -2.788381e-01 9.686524e-02 7.547549e-02 -1.895079e-01 75% 8.606705e-01 8.397670e-01 7.854492e-01 6.846384e-01 max 2.142618e+00 2.875868e+00 2.205397e+00 2.620248e+00
At this point, we are in a position to understand what is going on with the importance of data. Using the cumulative explained variance graph we can identify the components that needed to describe the dataset based on the required variance.
pca = PCA().fit(x) plt.plot(np.cumsum(pca.explained_variance_ratio_)) plt.xlabel('number of components') plt.ylabel('cumulative explained variance') # plt.show() plt.savefig('pca.png') print(pca.explained_variance_ratio_)
and the variance ration of each component:
[0.68837221 0.19313603 0.09136061 0.02713115]
Hmm, that means that using 3 components, we can capture approximately 97% of the variance.
References:
Posts from Shubhi Saxena and Prashant Banerjee