CHAPTER 12
Intermediate
Ridge Regression and Lasso Regression
Updated: May 16, 2026
6 min read
# CHAPTER 12
Ridge Regression and Lasso Regression
1. Introduction
In standard Linear Regression, the algorithm has one goal: minimize the error on the training data. It will do *anything* to achieve this, including assigning massive, erratic weights (coefficients) to useless features, resulting in extreme Overfitting. To fix this, statisticians invented Regularization—a mathematical penalty applied to the algorithm that essentially says: *"Minimize the error, but keep your coefficients as small and simple as possible."* In this chapter, we explore the two most powerful regularized models: Ridge and Lasso Regression.2. Learning Objectives
By the end of this chapter, you will be able to:- Explain the concept of Regularization.
- Understand the Alpha ($\alpha$) penalty parameter.
- Implement Ridge Regression (L2 Regularization).
- Implement Lasso Regression (L1 Regularization).
- Use Lasso for automatic Feature Selection.
3. The Problem: Exploding Coefficients
When a model overfits (like a Degree-15 Polynomial), if you look at the coefficients ($m$), they will be massive numbers (e.g., $m_1 = 4,500,000, m_2 = -4,499,000$). The model is highly unstable. A tiny change in the input data causes the prediction to swing wildly. We must force these coefficients closer to zero.4. Ridge Regression (L2 Regularization)
Ridge Regression adds a penalty to the Loss Function equal to the square of the coefficients.- How it works: It forces the algorithm to shrink all coefficients closer to zero. It will never force them to *exactly* zero, but it makes them very small.
- When to use it: When you have a dataset with many features, and you believe *all* of them contribute slightly to the prediction. It prevents any single feature from dominating the model.
python
5. Lasso Regression (L1 Regularization)
Lasso (Least Absolute Shrinkage and Selection Operator) adds a penalty equal to the absolute value of the coefficients.- How it works: Unlike Ridge, Lasso has a unique mathematical property: it forces the coefficients of useless features to become exactly 0.0.
- When to use it: When you have a dataset with 500 features, and you suspect 400 of them are useless noise. Lasso will automatically delete the useless features by turning their weights to 0! It is an algorithm that performs its own Feature Selection.
python
6. The Critical Importance of Scaling
WARNING: You MUST scale your features (usingStandardScaler from Chapter 9) before using Ridge or Lasso.
Because Regularization penalizes large coefficients, if Square_Feet is in the thousands and Bedrooms is 1 to 5, the raw coefficient for Square_Feet will naturally be tiny. Ridge/Lasso will penalize Bedrooms unfairly. Scaling ensures the penalty is applied equally to all features based on their true predictive power, not their raw numerical size.
7. Tuning the Alpha ($\alpha$) Parameter
Thealpha (often called $\lambda$ or lambda in statistics) is a hyperparameter you must choose.
-
If
alphais too low (e.g.,0.0001): The penalty is too weak. The model acts like standard Linear Regression and overfits.
-
If
alphais too high (e.g.,1000): The penalty is too strong. The model shrinks all coefficients to near zero and completely Underfits.
8. Common Mistakes
- Using Lasso on highly correlated features: If you have two features that are identical (e.g., Size in SqFt and Size in SqMeters), Lasso will randomly pick one and force the other to zero. Ridge will divide the weight equally between them.
- Forgetting to scale: As mentioned, running Ridge or Lasso on raw, unscaled CSV data renders the regularization completely mathematically invalid.
9. Best Practices
- Default to Ridge: If you don't know which to choose, start with Ridge Regression. It generally provides the most robust improvement over standard Linear Regression for preventing overfitting.
10. Exercises
-
1.
Which regression algorithm (Ridge or Lasso) is capable of reducing a feature's coefficient to exactly
0.0?
-
2.
If you increase the
alphaparameter in a Ridge model from1.0to50.0, what happens to the size of the model's coefficients?
11. MCQ Quiz with Answers
Question 1
What is the primary purpose of applying Regularization (Ridge/Lasso) to a regression model?
Question 2
Why is Lasso Regression often considered a "Feature Selection" tool?
12. Interview Questions
- Q: Explain the mathematical difference between the penalty applied in Ridge Regression (L2) versus Lasso Regression (L1).
- Q: Why is Feature Scaling (Standardization) an absolute mandatory prerequisite before training a Ridge or Lasso model?
13. FAQs
Q: Can I combine both Ridge and Lasso together? A: Yes! What if you want the robust shrinkage of Ridge, but the automatic feature deletion of Lasso? You combine them! This is called Elastic Net Regression, which we will cover in the very next chapter.14. Summary
When a Linear or Polynomial model overfits, it generates massive, unstable coefficients. Regularization acts as a mathematical leash. By applying analpha penalty, Ridge Regression shrinks all weights to create a highly stable, robust model. Lasso Regression applies a harsher penalty, actively driving useless features to zero, cleaning up noisy datasets automatically.