Data Science and Machine Learning Model Metrics
This page is a quick reference for common model evaluation metrics across tasks like classification, regression, and clustering. Each entry includes a definition, link to an explanation (mostly from Wikipedia) and a relevant Python package doc. This list is not meant to be exhaustive.
Note: This page focuses on performance metrics, not distance/similarity metrics (see the Distance and Similarity page)
Metrics explained
A metric is a number that measures model performance, or how well predictions match actual outcomes.
- In supervised learning, metrics evaluate prediction quality (e.g. RMSE, F1).
- In unsupervised learning, they assess structure or similarity (e.g. silhouette score).
- During training, metrics guide choices like model selection and early stopping.
Classification
Binary
Metric | Python | Details |
---|---|---|
Accuracy | skl | Accuracy. Proportion of correct predictions to total predictions. Simple and intuitive. Can be very misleading for imbalanced classes. |
Precision | skl | Precision. True Positives / (True Positives + False Positives). How many predicted positives are correct. |
Recall | skl | Recall. True Positives / (True Positives + False Negatives). How many actual positives were captured. |
F1 | skl | F1 Score. Harmonic mean of precision and recall. Good for imbalanced classes. |
ROC AUC | skl | ROC AUC. Area under the ROC curve. Evaluates ranking performance across thresholds. |
PR AUC | skl | PR AUC. Area under the Precision-Recall curve. Better than ROC AUC for rare positives. |
Log Loss | skl | Logarithmic Loss. Penalizes confident wrong predictions. Common in probabilistic classifiers. |
Balanced Acc | skl | Balanced Accuracy. Mean recall across classes. Helps with imbalanced classes. |
MCC | skl | Matthews Correlation Coefficient Balanced score even for class imbalance. |
Multiclass
Many classification metrics (like Precision, Recall, and F1) can be extended to multiclass problems by treating the data as a collection of binary classification tasks, one for each class, and then averaging over the results. Common averaging methods include:
- macro: unweighted average across classes
- micro: aggregates contributions of all classes (useful for class imbalance)
- weighted: like macro, but weighted by class support
Below are additional metrics or tools commonly used in multiclass settings:
Metric | Python | Details |
---|---|---|
Top-k Accuracy | skl | Top-k Accuracy. Fraction of times the true label is among the top k predicted classes. Useful when there are many classes or class labels are ambiguous. |
Cohen’s Kappa | skl | Cohen’s Kappa. Measures agreement between predicted and true labels, adjusted for chance. Range: -1 (complete disagreement) to 1 (perfect agreement). |
Averaging Methods | skl | Averaging (macro / micro / weighted). Affects how metrics like precision and recall are calculated across multiple classes. |
Clustering
External Metrics
These metrics evaluate how well the clusters match ground truth labels. Used when true labels are available.
Metric | Python | Details |
---|---|---|
ARI | skl | Adjusted Rand Index. Adjusted for chance. Values range from -1 to 1. Use to compare cluster labels with ground truth, even if labels are permuted. |
NMI | skl | Normalized Mutual Information. Measures shared information between clusters and labels. Ranges from 0 to 1. Good for imbalanced classes. |
FMI | skl | Fowlkes-Mallows Index. Geometric mean of precision and recall over pairwise cluster assignments. Range is 0 to 1. High when clusters and labels match well. |
Homogeneity | skl | Score is 1.0 when all clusters contain only members of a single class. Use when you want each cluster to be “pure.” |
Completeness | skl | Score is 1.0 when all members of a class are assigned to the same cluster. Use when you want to avoid splitting true classes. |
V-measure | skl | Harmonic mean of homogeneity and completeness. Range is 0 to 1. Balanced metric for external evaluation. |
Internal Metrics
These metrics evaluate clustering structure without using ground truth labels. They rely on intra-cluster cohesion and inter-cluster separation to assess clustering quality.
Metric | Python | Details |
---|---|---|
Silhouette | skl | Silhouette Score. Measures how similar points are to their own cluster vs others. Range is -1 to 1. Higher is better. |
Davies–Bouldin | skl | Davies–Bouldin Index. Lower values indicate better clustering. Sensitive to cluster overlap. |
Calinski–Harabasz | skl | Calinski–Harabasz Index. Ratio of between-cluster dispersion to within-cluster dispersion. Higher is better. |
Regression
Metric | Python | Details |
---|---|---|
MSE | skl | Mean Squared Error. Average squared difference between predictions and true values. Penalizes larger errors more; sensitive to outliers. Not in original units. |
RMSE | skl | Root Mean Squared Error. Same as MSE but in the original unit scale; easier to interpret. Still sensitive to outliers. |
MAE | skl | Mean Absolute Error. Average absolute difference between predictions and actual values. More robust to outliers than MSE. |
R² | skl | Coefficient of Determination. Measures proportion of variance explained by the model. Can be negative. |
Adj R² | Adjusted R². Like R² but penalizes for additional predictors. | |
MSLE | skl | Mean Squared Log Error. MSE on log-transformed targets. Good for targets spanning orders of magnitude. |
MAPE | skl | Mean Absolute Percentage Error. Average of absolute percentage errors. Can blow up if targets are near zero. |
SMAPE | Symmetric MAPE. Like MAPE but less sensitive to small denominators. Often used in time series. |
Time Series Forecasting
Many standard regression metrics are also commonly used in forecasting, including RMSE, MAE, and MAPE (described above).
The metrics below are more specific to time series forecasting and help address issues like scale or seasonality.
Metric | Python | Details |
---|---|---|
MASE | darts, sktime | Mean Absolute Scaled Error. Scales absolute error using a naive seasonal forecast. Interpretable across datasets. Value of 1.0 means same accuracy as naive baseline. |
WAPE | darts | Weighted Absolute Percentage Error. Like MAPE but weighted by actual values. Less sensitive to small denominators. Often used in retail and demand forecasting. |
Anomaly Detection - Tabular Data
Supervised (labels available)
When labels are available for anomalies, the problem becomes a binary classification task, and a model can be evaluated with some of the same metrics as used for binary classification (covered above): Precision, Recall, F1 Score, ROC AUC, PR AUC.
Below are additional metrics used in anomaly detection:
Metric | Python | Details |
---|---|---|
MCC | skl | Matthews Correlation Coefficient. Balanced score even for imbalanced classes. Range: -1 (total disagreement) to 0 (no better than random) to 1 (perfect agreement). |
Balanced Accuracy | skl | Balanced Accuracy. Average of recall for each class. |
Unsupervised (no labels)
When no ground truth labels are available, you can’t directly compute classification metrics like precision or recall.
Instead, the focus shifts to methods that evaluate cluster structure, separation, or reconstruction error, depending on the model used. Here are some options:
- Silhouette Score: Measures how similar each point is to its own cluster vs others. Useful for density-based methods like DBSCAN.
- Reconstruction Error: Used in autoencoder-based anomaly detection. High reconstruction error may signal an anomaly.
- Distance from Cluster Centers: In k-means or GMMs, outliers may be far from centroids.
- Top-n Scoring: Treat top N scored anomalies as flagged items, and assess with human validation or use domain-specific thresholds.
Anomaly Detection - Time Series
In time series anomaly detection, evaluation includes not just individual points, but also how well the method identifies anomalous time windows.
Some of the metrics used include:
Metric | Python | Details |
---|---|---|
Precision / Recall / F1 (windowed) | Window-based Precision / Recall / F1. Standard classification metrics applied over labeled anomaly windows instead of pointwise labels. Partial overlap typically counted as correct. | |
NAB Score | nab | Numenta Anomaly Benchmark (NAB) Score. Measures early and accurate detection within labeled windows. Penalizes late and false detections. |
Time-aware F1 | Time-aware F1 Score. Variant of F1 that allows some time tolerance around labeled anomalies. Useful when slight delay is acceptable. |
These metrics typically require:
- A ground truth label set with time ranges of anomalies
- Evaluation logic that accounts for early/late detection, duration, and false positives