This notebook explores three approaches to clustering datasets with mixed data types.
Uses the penguins dataset from seaborn, and applies different distance strategies and models. Final results are evaluated using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) and the known taget variable (penguin species).
Clustering mixed-type data (e.g. numeric + categorical) requires special handling of distance metrics. Standard clustering methods like KMeans assume numeric, Euclidean space. Here we use Gower distance or specialized models like k-prototypes that support categorical features directly.
We use feature standardization for the numeric columns before building the k-prototypes model. We don’t use standardization for the data that is used to create the Gower matrix (as that is handled internally).
This notebook does the following:
Prepares the data by dropping rows with NAs and separating numeric and categorical columns
Standardizes numerical values for k-prototypes clustering and creates a Gower matrix for DBSCAN and agglomerative clustering
Builds three clustering models:
k-prototypes (from kmodes)
DBSCAN (with Gower distance)
Agglomerative clustering (with Gower distance)
Compares results using ARI and NMI scores
Note: Using display for HTML tables
print(summarize(df)) and print(df.head()) return tables printed in plain text. To get nicer-formatted HTML tables, use the following instead of print():
from IPython.display import displaydisplay(df.head())# Display summarydisplay(summarize(df))
Import and Check Data
import seaborn as snsimport pandas as pdfrom minieda import summarize # pip install git+https://github.com/dbolotov/minieda.gitfrom pprint import pprintfrom kmodes.kprototypes import KPrototypesfrom sklearn.preprocessing import StandardScaler, LabelEncoderimport gowerfrom sklearn.cluster import DBSCAN, AgglomerativeClusteringfrom sklearn.metrics import adjusted_rand_score, normalized_mutual_info_scorepd.set_option("display.width", 220) # set display width for printed tables# Load dataset and display first few rowsdf = sns.load_dataset("penguins")print("----- First Few Rows of Data -----\n")print(df.head())# Display summaryprint("\n----- Data Summary -----\n")print(summarize(df))
----- First Few Rows of Data -----
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
----- Data Summary -----
dtype count unique unique_perc missing missing_perc zero zero_perc top freq mean std min 50% max skew
bill_length_mm float64 342 164 47.67 2 0.58 0 0.0 43.92 5.46 32.1 44.45 59.6 0.05
bill_depth_mm float64 342 80 23.26 2 0.58 0 0.0 17.15 1.97 13.1 17.3 21.5 -0.14
flipper_length_mm float64 342 55 15.99 2 0.58 0 0.0 200.92 14.06 172.0 197.0 231.0 0.35
body_mass_g float64 342 94 27.33 2 0.58 0 0.0 4201.75 801.95 2700.0 4050.0 6300.0 0.47
species object 344 3 0.87 0 0.00 0 0.0 Adelie 152
island object 344 3 0.87 0 0.00 0 0.0 Biscoe 168
sex object 333 2 0.58 11 3.20 0 0.0 Male 168
Transform Data
# Drop rows with missing valuesdf = df.dropna().reset_index(drop=True)# Define categorical and numeric columnscat_cols = ['island', 'sex']num_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']features = cat_cols + num_cols# Make an explicit copy to avoid chained assignment warningsX = df[features].copy()# Ensure categorical columns are stringsfor col in cat_cols: X[col] = X[col].astype(str)# For k-prototypes: scale numeric columnsX_kproto = X.copy()scaler = StandardScaler()X_kproto[num_cols] = scaler.fit_transform(X_kproto[num_cols])X_kproto_matrix = X_kproto.to_numpy()categorical_idx = [X_kproto.columns.get_loc(col) for col in cat_cols]# For Gower distance: create Gower matrix using raw datagower_matrix = gower.gower_matrix(X)