Distance and Similarity Measures
This page is a quick reference for common distance and similarity measures used in machine learning, clustering, and data matching. Each entry includes a brief description, supported data types, and links to Wikipedia articles and Python implementation. This list is not exhaustive.
Measures explained
Distance and similarity quantify how alike or different two objects are. These measures are often used in:
- Clustering algorithms (e.g. K-Means, DBSCAN)
- Nearest neighbor search (e.g. KNN)
- Deduplication or record linkage
- String comparison and vector similarity
A distance increases as objects differ more. A similarity increases as objects become more alike.
Distance Measures
Name | Data | Python | Description |
---|---|---|---|
Canberra | num | skl, scipy | Weighted version of Manhattan. More sensitive to small differences when values are near zero. |
Euclidean | num | skl, scipy | Straight-line (L2) distance between numeric vectors. Sensitive to scale. |
Gower | mixed | gower | Handles mixed data types by averaging scaled feature-wise dissimilarities. |
Hamming | bool, cat, string | skl, scipy | Counts number of differing positions in element-wise comparison. Works on binary vectors and strings of equal length. |
Levenshtein | string | textdistance | Edit distance that counts insertions, deletions, and substitutions. Strings can be different lengths. |
Mahalanobis | num | skl, scipy | Accounts for correlation and scale among variables. Useful for multivariate data. |
Manhattan | num | skl, scipy | Sum of absolute differences (L1 norm). Also called city-block distance. |
Squared Euclidean | num | skl, scipy | Same as Euclidean but without the square root. Useful for comparing relative closeness, especially in K-Means. |
Similarity Measures
Name | Data | Python | Description |
---|---|---|---|
Cosine Similarity | num | skl, scipy | Measures angle (not magnitude) between vectors. Common for text embeddings. |
Dice Coefficient | bool, cat, string | skl | Similar to Jaccard but gives more weight to matches. Used in fuzzy matching and bioinformatics. |
Jaccard Index | bool, cat | skl, scipy | Ratio of intersection to union. Used for comparing binary vectors or sets. |
Simple Matching Coefficient | bool, cat, string | kmodes | Proportion of element-wise matches between two equal-length vectors. Treats 0-0 and 1-1 matches equally. Can be used with binary, categorical, or string data when comparing character by character. |