Common Problems in Data Science

Data

Problem Solution
Not enough data Use data augmentation or synthetic sampling (e.g. SMOTE, SDV)
Data not representative of the distribution Reassess how data was collected; consider stratified sampling
Imbalanced classes Try resampling, adjusting class weights, or adding synthetic data with SDV
Too much data (examples) Subsample or use mini-batch training; profile before full-scale training
Too many features / high dimensionality Apply feature selection or dimensionality reduction (e.g. PCA)
Data has extreme values, outliers, or anomalies Use robust statistics, or find such values using outlier/anomaly detection methods. Consider removing examples.
Data may have been faked Check for duplicate rows, unnatural distributions, and value repetition
Data leakage Review data sources and pipeline; make sure target isn’t leaking into features

Modeling

Problem Solution
Model performs well on training, terrible on test (overfitting) Reduce model complexity, add regularization, or get more data
Model performs poorly on training AND test data (underfitting) Use a more complex model, add better features, reduce regularization.
Classification model worse than a random guess or worse than majority class guess Investigate data quality, imbalanced classes.
Model performs unusually well on train and test data Check for data leakage; the model may have access to information it shouldn’t.

Workflow

Problem Solution
Jupyter notebook is too large Avoid storing large Plotly outputs; clean outputs or split the notebook
Model training takes too long Use smaller subsets for tuning; simplify the model or parallelize training