Distance and Similarity Measures

This page is a quick reference for common distance and similarity measures used in machine learning, clustering, and data matching. Each entry includes a brief description, supported data types, and links to Wikipedia articles and Python implementation. This list is not exhaustive.

Measures explained

Distance and similarity quantify how alike or different two objects are. These measures are often used in:

Clustering algorithms (e.g. K-Means, DBSCAN)
Nearest neighbor search (e.g. KNN)
Deduplication or record linkage
String comparison and vector similarity

A distance increases as objects differ more. A similarity increases as objects become more alike.

Distance Measures

Name	Data	Python	Description
Canberra	num	skl, scipy	Weighted version of Manhattan. More sensitive to small differences when values are near zero.
Euclidean	num	skl, scipy	Straight-line (L2) distance between numeric vectors. Sensitive to scale.
Gower	mixed	gower	Handles mixed data types by averaging scaled feature-wise dissimilarities.
Hamming	bool, cat, string	skl, scipy	Counts number of differing positions in element-wise comparison. Works on binary vectors and strings of equal length.
Levenshtein	string	textdistance	Edit distance that counts insertions, deletions, and substitutions. Strings can be different lengths.
Mahalanobis	num	skl, scipy	Accounts for correlation and scale among variables. Useful for multivariate data.
Manhattan	num	skl, scipy	Sum of absolute differences (L1 norm). Also called city-block distance.
Squared Euclidean	num	skl, scipy	Same as Euclidean but without the square root. Useful for comparing relative closeness, especially in K-Means.

Similarity Measures

Name	Data	Python	Description
Cosine Similarity	num	skl, scipy	Measures angle (not magnitude) between vectors. Common for text embeddings.
Dice Coefficient	bool, cat, string	skl	Similar to Jaccard but gives more weight to matches. Used in fuzzy matching and bioinformatics.
Jaccard Index	bool, cat	skl, scipy	Ratio of intersection to union. Used for comparing binary vectors or sets.
Simple Matching Coefficient	bool, cat, string	kmodes	Proportion of element-wise matches between two equal-length vectors. Treats 0-0 and 1-1 matches equally. Can be used with binary, categorical, or string data when comparing character by character.

Measures explained

Related terms

Distance Measures

Similarity Measures