Common Problems in Data Science

Data

Problem	Solution
Not enough data	Use data augmentation or synthetic sampling (e.g. SMOTE, SDV)
Data not representative of the distribution	Reassess how data was collected; consider stratified sampling
Imbalanced classes	Try resampling, adjusting class weights, or adding synthetic data with SDV
Too much data (examples)	Subsample or use mini-batch training; profile before full-scale training
Too many features / high dimensionality	Apply feature selection or dimensionality reduction (e.g. PCA)
Data has extreme values, outliers, or anomalies	Use robust statistics, or find such values using outlier/anomaly detection methods. Consider removing examples.
Data may have been faked	Check for duplicate rows, unnatural distributions, and value repetition
Data leakage	Review data sources and pipeline; make sure target isn’t leaking into features

Problem	Solution
Model performs well on training, terrible on test (overfitting)	Reduce model complexity, add regularization, or get more data
Model performs poorly on training AND test data (underfitting)	Use a more complex model, add better features, reduce regularization.
Classification model worse than a random guess or worse than majority class guess	Investigate data quality, imbalanced classes.
Model performs unusually well on train and test data	Check for data leakage; the model may have access to information it shouldn’t.

Problem	Solution
Jupyter notebook is too large	Avoid storing large Plotly outputs; clean outputs or split the notebook
Model training takes too long	Use smaller subsets for tuning; simplify the model or parallelize training