Regularization
Regularization is a set of techniques used to prevent overfitting by discouraging models from becoming too complex. It works by adding a penalty to the loss function during training. Regularization helps generalize to new data, reduces model variance, and often improves interpretability.
L1 Regularization (Lasso):
- Adds the sum of absolute values of the weights to the loss: \(\text{Loss} = \text{MSE} + \lambda \sum_i |w_i|\)
- Encourages sparse models by setting some coefficients to exactly zero
- Good for feature selection
L2 Regularization (Ridge):
- Adds the sum of squared weights to the loss: \(\text{Loss} = \text{MSE} + \lambda \sum_i w_i^2\)
- Shrinks all weights but keeps them nonzero
- Useful when many small/medium-sized coefficients are expected
Combined: Elastic Net
- Mix of L1 and L2 penalties
- Used when both sparsity and stability are needed
In specific model types
Neural Networks
- Dropout: randomly disables neurons during training
- Weight decay: adds L2 penalty on network weights
- Batch normalization: stabilizes training and can reduce overfitting indirectly
- Early stopping: monitors validation performance and stops training when improvement stalls.
Tree-Based Models
XGBoost supports L1 (alpha) and L2 (lambda) penalties, plus shrinkage (learning rate), tree pruning, and early stopping.
Random Forest uses structural constraints instead:
- Bagging and feature subsampling
- Max depth / min leaf size as built-in regularizers
Notes on Terminology
L1 norm: \(\|w\|_1 = \sum_i |w_i|\)
L2 norm: \(\|w\|_2 = \sqrt{\sum_i w_i^2}\), but usually squared in practice
“Lasso” stands for Least Absolute Shrinkage and Selection Operator (Tibshirani, 1996).
“Ridge” comes from ridge traces — plots of coefficients vs. penalty strength.