Data Science and Machine Learning Model Metrics

This page is a quick reference for common model evaluation metrics across tasks like classification, regression, and clustering. Each entry includes a definition, link to an explanation (mostly from Wikipedia) and a relevant Python package doc. This list is not meant to be exhaustive.

Note: This page focuses on performance metrics, not distance/similarity metrics (see the Distance and Similarity page)

Metrics explained

A metric is a number that measures model performance, or how well predictions match actual outcomes.

  • In supervised learning, metrics evaluate prediction quality (e.g. RMSE, F1).
  • In unsupervised learning, they assess structure or similarity (e.g. silhouette score).
  • During training, metrics guide choices like model selection and early stopping.

Classification

Binary

Metric Python Details
Accuracy skl Accuracy. Proportion of correct predictions to total predictions. Simple and intuitive. Can be very misleading for imbalanced classes.
Precision skl Precision. True Positives / (True Positives + False Positives). How many predicted positives are correct.
Recall skl Recall. True Positives / (True Positives + False Negatives). How many actual positives were captured.
F1 skl F1 Score. Harmonic mean of precision and recall. Good for imbalanced classes.
ROC AUC skl ROC AUC. Area under the ROC curve. Evaluates ranking performance across thresholds.
PR AUC skl PR AUC. Area under the Precision-Recall curve. Better than ROC AUC for rare positives.
Log Loss skl Logarithmic Loss. Penalizes confident wrong predictions. Common in probabilistic classifiers.
Balanced Acc skl Balanced Accuracy. Mean recall across classes. Helps with imbalanced classes.
MCC skl Matthews Correlation Coefficient Balanced score even for class imbalance.

Multiclass

Many classification metrics (like Precision, Recall, and F1) can be extended to multiclass problems by treating the data as a collection of binary classification tasks, one for each class, and then averaging over the results. Common averaging methods include:

  • macro: unweighted average across classes
  • micro: aggregates contributions of all classes (useful for class imbalance)
  • weighted: like macro, but weighted by class support

Below are additional metrics or tools commonly used in multiclass settings:

Metric Python Details
Top-k Accuracy skl Top-k Accuracy. Fraction of times the true label is among the top k predicted classes. Useful when there are many classes or class labels are ambiguous.
Cohen’s Kappa skl Cohen’s Kappa. Measures agreement between predicted and true labels, adjusted for chance. Range: -1 (complete disagreement) to 1 (perfect agreement).
Averaging Methods skl Averaging (macro / micro / weighted). Affects how metrics like precision and recall are calculated across multiple classes.

Clustering

External Metrics

These metrics evaluate how well the clusters match ground truth labels. Used when true labels are available.

Metric Python Details
ARI skl Adjusted Rand Index. Adjusted for chance. Values range from -1 to 1. Use to compare cluster labels with ground truth, even if labels are permuted.
NMI skl Normalized Mutual Information. Measures shared information between clusters and labels. Ranges from 0 to 1. Good for imbalanced classes.
FMI skl Fowlkes-Mallows Index. Geometric mean of precision and recall over pairwise cluster assignments. Range is 0 to 1. High when clusters and labels match well.
Homogeneity skl Score is 1.0 when all clusters contain only members of a single class. Use when you want each cluster to be “pure.”
Completeness skl Score is 1.0 when all members of a class are assigned to the same cluster. Use when you want to avoid splitting true classes.
V-measure skl Harmonic mean of homogeneity and completeness. Range is 0 to 1. Balanced metric for external evaluation.

Internal Metrics

These metrics evaluate clustering structure without using ground truth labels. They rely on intra-cluster cohesion and inter-cluster separation to assess clustering quality.

Metric Python Details
Silhouette skl Silhouette Score. Measures how similar points are to their own cluster vs others. Range is -1 to 1. Higher is better.
Davies–Bouldin skl Davies–Bouldin Index. Lower values indicate better clustering. Sensitive to cluster overlap.
Calinski–Harabasz skl Calinski–Harabasz Index. Ratio of between-cluster dispersion to within-cluster dispersion. Higher is better.

Regression

Metric Python Details
MSE skl Mean Squared Error. Average squared difference between predictions and true values. Penalizes larger errors more; sensitive to outliers. Not in original units.
RMSE skl Root Mean Squared Error. Same as MSE but in the original unit scale; easier to interpret. Still sensitive to outliers.
MAE skl Mean Absolute Error. Average absolute difference between predictions and actual values. More robust to outliers than MSE.
skl Coefficient of Determination. Measures proportion of variance explained by the model. Can be negative.
Adj R² Adjusted R². Like R² but penalizes for additional predictors.
MSLE skl Mean Squared Log Error. MSE on log-transformed targets. Good for targets spanning orders of magnitude.
MAPE skl Mean Absolute Percentage Error. Average of absolute percentage errors. Can blow up if targets are near zero.
SMAPE Symmetric MAPE. Like MAPE but less sensitive to small denominators. Often used in time series.

Time Series Forecasting

Many standard regression metrics are also commonly used in forecasting, including RMSE, MAE, and MAPE (described above).

The metrics below are more specific to time series forecasting and help address issues like scale or seasonality.

Metric Python Details
MASE darts, sktime Mean Absolute Scaled Error. Scales absolute error using a naive seasonal forecast. Interpretable across datasets. Value of 1.0 means same accuracy as naive baseline.
WAPE darts Weighted Absolute Percentage Error. Like MAPE but weighted by actual values. Less sensitive to small denominators. Often used in retail and demand forecasting.

Anomaly Detection - Tabular Data

Supervised (labels available)

When labels are available for anomalies, the problem becomes a binary classification task, and a model can be evaluated with some of the same metrics as used for binary classification (covered above): Precision, Recall, F1 Score, ROC AUC, PR AUC.

Below are additional metrics used in anomaly detection:

Metric Python Details
MCC skl Matthews Correlation Coefficient. Balanced score even for imbalanced classes. Range: -1 (total disagreement) to 0 (no better than random) to 1 (perfect agreement).
Balanced Accuracy skl Balanced Accuracy. Average of recall for each class.

Unsupervised (no labels)

When no ground truth labels are available, you can’t directly compute classification metrics like precision or recall.

Instead, the focus shifts to methods that evaluate cluster structure, separation, or reconstruction error, depending on the model used. Here are some options:

  • Silhouette Score: Measures how similar each point is to its own cluster vs others. Useful for density-based methods like DBSCAN.
  • Reconstruction Error: Used in autoencoder-based anomaly detection. High reconstruction error may signal an anomaly.
  • Distance from Cluster Centers: In k-means or GMMs, outliers may be far from centroids.
  • Top-n Scoring: Treat top N scored anomalies as flagged items, and assess with human validation or use domain-specific thresholds.

Anomaly Detection - Time Series

In time series anomaly detection, evaluation includes not just individual points, but also how well the method identifies anomalous time windows.

Some of the metrics used include:

Metric Python Details
Precision / Recall / F1 (windowed) Window-based Precision / Recall / F1. Standard classification metrics applied over labeled anomaly windows instead of pointwise labels. Partial overlap typically counted as correct.
NAB Score nab Numenta Anomaly Benchmark (NAB) Score. Measures early and accurate detection within labeled windows. Penalizes late and false detections.
Time-aware F1 Time-aware F1 Score. Variant of F1 that allows some time tolerance around labeled anomalies. Useful when slight delay is acceptable.

These metrics typically require:

  • A ground truth label set with time ranges of anomalies
  • Evaluation logic that accounts for early/late detection, duration, and false positives