Common Problems in Data Science
Data
Problem | Solution |
---|---|
Not enough data | Use data augmentation or synthetic sampling (e.g. SMOTE, SDV) |
Data not representative of the distribution | Reassess how data was collected; consider stratified sampling |
Imbalanced classes | Try resampling, adjusting class weights, or adding synthetic data with SDV |
Too much data (examples) | Subsample or use mini-batch training; profile before full-scale training |
Too many features / high dimensionality | Apply feature selection or dimensionality reduction (e.g. PCA) |
Data has extreme values, outliers, or anomalies | Use robust statistics, or find such values using outlier/anomaly detection methods. Consider removing examples. |
Data may have been faked | Check for duplicate rows, unnatural distributions, and value repetition |
Data leakage | Review data sources and pipeline; make sure target isn’t leaking into features |
Modeling
Problem | Solution |
---|---|
Model performs well on training, terrible on test (overfitting) | Reduce model complexity, add regularization, or get more data |
Model performs poorly on training AND test data (underfitting) | Use a more complex model, add better features, reduce regularization. |
Classification model worse than a random guess or worse than majority class guess | Investigate data quality, imbalanced classes. |
Model performs unusually well on train and test data | Check for data leakage; the model may have access to information it shouldn’t. |
Workflow
Problem | Solution |
---|---|
Jupyter notebook is too large | Avoid storing large Plotly outputs; clean outputs or split the notebook |
Model training takes too long | Use smaller subsets for tuning; simplify the model or parallelize training |