Common Problems in Data Science
Data
| Problem | Solution |
|---|---|
| Not enough data | Use data augmentation or synthetic sampling (e.g. SMOTE, SDV) |
| Data not representative of the distribution | Reassess how data was collected; consider stratified sampling |
| Imbalanced classes | Try resampling, adjusting class weights, or adding synthetic data with SDV |
| Too much data (examples) | Subsample or use mini-batch training; profile before full-scale training |
| Too many features / high dimensionality | Apply feature selection or dimensionality reduction (e.g. PCA) |
| Data has extreme values, outliers, or anomalies | Use robust statistics, or find such values using outlier/anomaly detection methods. Consider removing examples. |
| Data may have been faked | Check for duplicate rows, unnatural distributions, and value repetition |
| Data leakage | Review data sources and pipeline; make sure target isn’t leaking into features |
Modeling
| Problem | Solution |
|---|---|
| Model performs well on training, terrible on test (overfitting) | Reduce model complexity, add regularization, or get more data |
| Model performs poorly on training AND test data (underfitting) | Use a more complex model, add better features, reduce regularization. |
| Classification model worse than a random guess or worse than majority class guess | Investigate data quality, imbalanced classes. |
| Model performs unusually well on train and test data | Check for data leakage; the model may have access to information it shouldn’t. |
Workflow
| Problem | Solution |
|---|---|
| Jupyter notebook is too large | Avoid storing large Plotly outputs; clean outputs or split the notebook |
| Model training takes too long | Use smaller subsets for tuning; simplify the model or parallelize training |