Data Science and Machine Learning Model Metrics

This page is a quick reference for common model evaluation metrics across tasks like classification, regression, and clustering. Each entry includes a definition, link to an explanation (mostly from Wikipedia) and a relevant Python package doc. This list is not meant to be exhaustive.

Note: This page focuses on performance metrics, not distance/similarity metrics (see the Distance and Similarity page)

Metrics explained

A metric is a number that measures model performance, or how well predictions match actual outcomes.

In supervised learning, metrics evaluate prediction quality (e.g. RMSE, F1).
In unsupervised learning, they assess structure or similarity (e.g. silhouette score).
During training, metrics guide choices like model selection and early stopping.

Classification

Binary

Note: Models that output probabilities (e.g., logistic regression, neural nets) can be evaluated using these same metrics by thresholding at 0.5 or another chosen value.

Metric	Python	Details
Accuracy	skl	Accuracy. Proportion of correct predictions to total predictions. Can be very misleading for imbalanced classes.
Precision	skl	Precision. True Positives / (True Positives + False Positives). How many predicted positives are correct.
Recall	skl	Recall. True Positives / (True Positives + False Negatives). How many actual positives were captured.
F1	skl	F1 Score. Harmonic mean of precision and recall. Good for imbalanced classes.
ROC AUC	skl	ROC AUC. Area under the ROC curve. Evaluates ranking performance across thresholds.
PR AUC	skl	PR AUC. Area under the Precision-Recall curve. Better than ROC AUC for rare positives.
Balanced Acc	skl	Balanced Accuracy. Mean recall across classes. Helps with imbalanced classes.
MCC	skl	Matthews Correlation Coefficient Balanced score even for class imbalance.

Multiclass

Many classification metrics (like Precision, Recall, and F1) can be extended to multiclass problems by treating the data as a collection of binary classification tasks, one for each class, and then averaging over the results. Common averaging methods include:

macro: unweighted average across classes
micro: aggregates contributions of all classes (useful for class imbalance)
weighted: like macro, but weighted by class support

Below are additional metrics or tools commonly used in multiclass settings:

Metric	Python	Details
Top-k Accuracy	skl	Top-k Accuracy. Fraction of times the true label is among the top k predicted classes. Useful when there are many classes or class labels are ambiguous.
Cohen’s Kappa	skl	Cohen’s Kappa. Measures agreement between predicted and true labels, adjusted for chance. Range: -1 (complete disagreement) to 1 (perfect agreement).
Averaging Methods	skl	Averaging (macro / micro / weighted). Affects how metrics like precision and recall are calculated across multiple classes.

Regression

Metric	Python	Details
MSE	skl	Mean Squared Error. Average squared difference between predictions and true values. Penalizes larger errors more; sensitive to outliers. Not in original units.
RMSE	skl	Root Mean Squared Error. Same as MSE but in the original unit scale; easier to interpret. Still sensitive to outliers.
MAE	skl	Mean Absolute Error. Average absolute difference between predictions and actual values. More robust to outliers than MSE.
R²	skl	Coefficient of Determination. Measures proportion of variance explained by the model. Can be negative.
Adj R²		Adjusted R². Like R² but penalizes for additional predictors.
MSLE	skl	Mean Squared Log Error. MSE on log-transformed targets. Good for targets spanning orders of magnitude.
MAPE	skl	Mean Absolute Percentage Error. Average of absolute percentage errors. Can blow up if targets are near zero.
SMAPE		Symmetric MAPE. Like MAPE but less sensitive to small denominators. Often used in time series.

Time Series Forecasting

Many standard regression metrics are also commonly used in forecasting, including RMSE, MAE, and MAPE (described above).

The metrics below are more specific to time series forecasting and help address issues like scale or seasonality.

Metric	Python	Details
MASE	darts, sktime	Mean Absolute Scaled Error. Scales absolute error using a naive seasonal forecast. Interpretable across datasets. Value of 1.0 means same accuracy as naive baseline.
WAPE	darts	Weighted Absolute Percentage Error. Like MAPE but weighted by actual values. Less sensitive to small denominators. Often used in retail and demand forecasting.

Clustering

External Metrics

These metrics evaluate how well the clusters match ground truth labels. Used when true labels are available.

Metric	Python	Details
ARI	skl	Adjusted Rand Index. Adjusted for chance. Values range from -1 to 1. Use to compare cluster labels with ground truth, even if labels are permuted.
NMI	skl	Normalized Mutual Information. Measures shared information between clusters and labels. Ranges from 0 to 1. Good for imbalanced classes.
FMI	skl	Fowlkes-Mallows Index. Geometric mean of precision and recall over pairwise cluster assignments. Range is 0 to 1. High when clusters and labels match well.
Homogeneity	skl	Score is 1.0 when all clusters contain only members of a single class. Use when you want each cluster to be “pure.”
Completeness	skl	Score is 1.0 when all members of a class are assigned to the same cluster. Use when you want to avoid splitting true classes.
V-measure	skl	Harmonic mean of homogeneity and completeness. Range is 0 to 1. Balanced metric for external evaluation.

Internal Metrics

These metrics evaluate clustering structure without using ground truth labels. They rely on intra-cluster cohesion and inter-cluster separation to assess clustering quality.

Metric	Python	Details
Silhouette	skl	Silhouette Score. Measures how similar points are to their own cluster vs others. Range is -1 to 1. Higher is better.
Davies–Bouldin	skl	Davies–Bouldin Index. Lower values indicate better clustering. Sensitive to cluster overlap.
Calinski–Harabasz	skl	Calinski–Harabasz Index. Ratio of between-cluster dispersion to within-cluster dispersion. Higher is better.

Anomaly Detection - Tabular Data

When labels are available for anomalies, the problem becomes a binary classification task (anomaly vs not anomaly), and a model can be evaluated with some of the same metrics as used for binary classification (covered above): Precision, Recall, F1 Score, ROC AUC, PR AUC. Care has to be taken since the classes will be unbalanced (e.g., 95% normal and 5% anomaly).

Below are additional metrics used in anomaly detection:

Metric	Python	Details
MCC	skl	Matthews Correlation Coefficient. Balanced score even for imbalanced classes. Range: -1 (total disagreement) to 0 (no better than random) to 1 (perfect agreement).
Balanced Accuracy	skl	Balanced Accuracy. Average of recall for each class.

Anomaly Detection - Time Series

In time series anomaly detection, evaluation includes not just individual points, but also how well the method identifies anomalous time windows. These metrics typically require:

A ground truth label set with time ranges of anomalies
Evaluation logic that accounts for early/late detection, duration, and false positives

Some of the metrics used include:

Metric	Python	Details
Precision / Recall / F1 (windowed)		Window-based Precision / Recall / F1. Standard classification metrics applied over labeled anomaly windows instead of pointwise labels. Partial overlap typically counted as correct.
NAB Score	nab	Numenta Anomaly Benchmark (NAB) Score. Measures early and accurate detection within labeled windows. Penalizes late and false detections.
Time-aware F1		Time-aware F1 Score. Variant of F1 that allows some time tolerance around labeled anomalies. Useful when slight delay is acceptable.