DS QuickRef

Glossary

Oversampling: A technique to balance class distribution in imbalanced datasets by duplicating examples from the minority class.
Overfitting and Underfitting
- Overfitting: the model memorizes training data and fails to generalize.
- Underfitting: the model is too simple to capture patterns in the data.
Data leakage: When information from outside the training dataset sneaks into the model, leading to overly optimistic performance.
- Example: Imputing missing values before splitting data into train/test sets.
Types of learning
- Supervised learning: learning from labeled data.
- Unsupervised learning: finding patterns in unlabeled data.
- Semi-supervised learning: learning from a mix of labeled and unlabeled data.
- Deep learning: a subset of ML using multi-layer neural networks.
- Reinforcement learning: learning by trial and error to maximize a reward signal.
Regularization: Methods to prevent overfitting by penalizing complexity or limiting model flexibility.
- L1 regularization (Lasso): Can shrink some coefficients to zero, effectively removing features. Useful for feature selection.
- L2 regularization (Ridge): Shrinks all coefficients toward zero but doesn’t eliminate any. Helps when many features contribute a little.
Bias-variance tradeoff: The balance between underfitting (high bias) and overfitting (high variance). A key concept in model performance tuning.
Model bias: Systematic error that leads a model to consistently make inaccurate predictions in a specific direction. Often caused by overly simple assumptions or biased data.
- Bias-variance tradeoff: A fundamental concept in modeling. High bias leads to underfitting (model too simple), while high variance leads to overfitting (model too sensitive to training data). Good models strike a balance.
- Model variance: Error due to the model reacting too strongly to small fluctuations in training data. Leads to poor generalization.
- Sources of bias in data:
  - Label bias: Training labels are inaccurate or inconsistent.
  - Sampling bias: The dataset isn’t representative of the real-world population.
  - Measurement bias: Inputs are recorded in a flawed or inconsistent way.
Bootstrapping: A resampling method that draws repeated samples (with replacement) from the data to estimate uncertainty, confidence intervals, or test statistics.
Confidence intervals: A range of values, derived from sample data, that likely contains the true population parameter. Often interpreted (loosely) as: “We’re 95% confident the true value lies in this range.”
Prediction intervals: A range that likely contains a future individual prediction, not just the average. Wider than confidence intervals because they include both model and data uncertainty.