Distance and Similarity Measures

This page is a quick reference for common distance and similarity measures used in machine learning, clustering, and data matching. Each entry includes a brief description, supported data types, and links to Wikipedia articles and Python implementation. This list is not exhaustive.

Measures explained

Distance and similarity quantify how alike or different two objects are. These measures are often used in:

  • Clustering algorithms (e.g. K-Means, DBSCAN)
  • Nearest neighbor search (e.g. KNN)
  • Deduplication or record linkage
  • String comparison and vector similarity

A distance increases as objects differ more. A similarity increases as objects become more alike.

Distance Measures

Name Data Python Description
Canberra num skl, scipy Weighted version of Manhattan. More sensitive to small differences when values are near zero.
Euclidean num skl, scipy Straight-line (L2) distance between numeric vectors. Sensitive to scale.
Gower mixed gower Handles mixed data types by averaging scaled feature-wise dissimilarities.
Hamming bool, cat, string skl, scipy Counts number of differing positions in element-wise comparison. Works on binary vectors and strings of equal length.
Levenshtein string textdistance Edit distance that counts insertions, deletions, and substitutions. Strings can be different lengths.
Mahalanobis num skl, scipy Accounts for correlation and scale among variables. Useful for multivariate data.
Manhattan num skl, scipy Sum of absolute differences (L1 norm). Also called city-block distance.
Squared Euclidean num skl, scipy Same as Euclidean but without the square root. Useful for comparing relative closeness, especially in K-Means.

Similarity Measures

Name Data Python Description
Cosine Similarity num skl, scipy Measures angle (not magnitude) between vectors. Common for text embeddings.
Dice Coefficient bool, cat, string skl Similar to Jaccard but gives more weight to matches. Used in fuzzy matching and bioinformatics.
Jaccard Index bool, cat skl, scipy Ratio of intersection to union. Used for comparing binary vectors or sets.
Simple Matching Coefficient bool, cat, string kmodes Proportion of element-wise matches between two equal-length vectors. Treats 0-0 and 1-1 matches equally. Can be used with binary, categorical, or string data when comparing character by character.