Clustering with Mixed-Type Tabular Data

This notebook explores three approaches to clustering datasets with mixed data types.

Uses the penguins dataset from seaborn, and applies different distance strategies and models. Final results are evaluated using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) and the known taget variable (penguin species).

Clustering mixed-type data (e.g. numeric + categorical) requires special handling of distance metrics. Standard clustering methods like KMeans assume numeric, Euclidean space. Here we use Gower distance or specialized models like k-prototypes that support categorical features directly.

We use feature standardization for the numeric columns before building the k-prototypes model. We don’t use standardization for the data that is used to create the Gower matrix (as that is handled internally).

This notebook does the following:

print(summarize(df)) and print(df.head()) return tables printed in plain text. To get nicer-formatted HTML tables, use the following instead of print():

from IPython.display import display
display(df.head())

# Display summary
display(summarize(df))

Import and Check Data

import seaborn as sns
import pandas as pd
from minieda import summarize # pip install git+https://github.com/dbolotov/minieda.git
from pprint import pprint

from kmodes.kprototypes import KPrototypes
from sklearn.preprocessing import StandardScaler, LabelEncoder
import gower
from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score,  normalized_mutual_info_score

pd.set_option("display.width", 220) # set display width for printed tables

# Load dataset and display first few rows
df = sns.load_dataset("penguins")

print("----- First Few Rows of Data -----\n")
print(df.head())

# Display summary
print("\n----- Data Summary -----\n")
print(summarize(df))
----- First Few Rows of Data -----

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    Male
1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  Female
2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  Female
3  Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN
4  Adelie  Torgersen            36.7           19.3              193.0       3450.0  Female

----- Data Summary -----

                     dtype  count  unique  unique_perc  missing  missing_perc  zero  zero_perc     top freq     mean     std     min     50%     max  skew
bill_length_mm     float64    342     164        47.67        2          0.58     0        0.0                 43.92    5.46    32.1   44.45    59.6  0.05
bill_depth_mm      float64    342      80        23.26        2          0.58     0        0.0                 17.15    1.97    13.1    17.3    21.5 -0.14
flipper_length_mm  float64    342      55        15.99        2          0.58     0        0.0                200.92   14.06   172.0   197.0   231.0  0.35
body_mass_g        float64    342      94        27.33        2          0.58     0        0.0               4201.75  801.95  2700.0  4050.0  6300.0  0.47
species             object    344       3         0.87        0          0.00     0        0.0  Adelie  152                                               
island              object    344       3         0.87        0          0.00     0        0.0  Biscoe  168                                               
sex                 object    333       2         0.58       11          3.20     0        0.0    Male  168                                               

Transform Data

# Drop rows with missing values
df = df.dropna().reset_index(drop=True)

# Define categorical and numeric columns
cat_cols = ['island', 'sex']
num_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
features = cat_cols + num_cols

# Make an explicit copy to avoid chained assignment warnings
X = df[features].copy()

# Ensure categorical columns are strings
for col in cat_cols:
    X[col] = X[col].astype(str)

# For k-prototypes: scale numeric columns
X_kproto = X.copy()
scaler = StandardScaler()
X_kproto[num_cols] = scaler.fit_transform(X_kproto[num_cols])
X_kproto_matrix = X_kproto.to_numpy()
categorical_idx = [X_kproto.columns.get_loc(col) for col in cat_cols]

# For Gower distance: create Gower matrix using raw data
gower_matrix = gower.gower_matrix(X)

Build Models - K-Prototypes, DBSCAN, AgglomerativeClustering

# Copy base DataFrame for clustering
df_clustered = df.copy()

# Create a results dictionary
results = {}

# K-Prototypes
kproto = KPrototypes(n_clusters=3, init='Huang', verbose=0, random_state=42)
kproto_labels = kproto.fit_predict(X_kproto_matrix, categorical=categorical_idx)
df_clustered['kproto'] = kproto_labels

results['kproto'] = {
    'ARI': adjusted_rand_score(df_clustered["species"], kproto_labels),
    'NMI': normalized_mutual_info_score(df_clustered["species"], kproto_labels)
}

# DBSCAN with Gower matrix

# Optional: tune eps manually
# for eps in [0.05, 0.1, 0.15, 0.17]:
#     model = DBSCAN(eps=eps, min_samples=5, metric='precomputed')
#     labels = model.fit_predict(gower_matrix)
#     print(f"eps={eps:.2f} → clusters: {len(set(labels)) - (1 if -1 in labels else 0)}, noise: {(labels == -1).sum()}")

dbscan_model = DBSCAN(eps=0.15, min_samples=5, metric='precomputed')
dbscan_labels = dbscan_model.fit_predict(gower_matrix)
df_clustered['gower_dbscan'] = dbscan_labels

results['gower_dbscan'] = {
    'ARI': adjusted_rand_score(df_clustered["species"], dbscan_labels),
    'NMI': normalized_mutual_info_score(df_clustered["species"], dbscan_labels)
}

# Agglomerative Clustering with Gower
agglo_model = AgglomerativeClustering(n_clusters=3, metric='precomputed', linkage='average')
agglo_labels = agglo_model.fit_predict(gower_matrix)
df_clustered['gower_aggl'] = agglo_labels

results['gower_aggl'] = {
    'ARI': adjusted_rand_score(df_clustered["species"], agglo_labels),
    'NMI': normalized_mutual_info_score(df_clustered["species"], agglo_labels)
}

Evaluate

# Evaluation summary table
evaluation_df = pd.DataFrame(results).T
print("----- ARI and NMI Summary -----\n")
print(evaluation_df)

# Label encode true species values for confusion matrices
le = LabelEncoder()
true_labels = le.fit_transform(df_clustered["species"])

# Print cluster counts and confusion matrices
print("\n----- Per-Model Cluster Counts -----")
for model_name in ['kproto', 'gower_dbscan', 'gower_aggl']:
    print(f"\n----- {model_name} -----\n")
    
    cluster_counts = df_clustered[model_name].value_counts()
    print(cluster_counts)
----- ARI and NMI Summary -----

                   ARI       NMI
kproto        0.733731  0.739709
gower_dbscan  0.212538  0.375291
gower_aggl    0.542704  0.606129

----- Per-Model Cluster Counts -----

----- kproto -----

kproto
0    124
2    119
1     90
Name: count, dtype: int64

----- gower_dbscan -----

gower_dbscan
3    83
2    80
5    62
4    61
1    24
0    23
Name: count, dtype: int64

----- gower_aggl -----

gower_aggl
0    119
1    107
2    107
Name: count, dtype: int64