Skip to main content
Regression Models
CHAPTER 12 Intermediate

Ridge Regression and Lasso Regression

Updated: May 16, 2026
6 min read

# CHAPTER 12

Ridge Regression and Lasso Regression

1. Introduction

In standard Linear Regression, the algorithm has one goal: minimize the error on the training data. It will do *anything* to achieve this, including assigning massive, erratic weights (coefficients) to useless features, resulting in extreme Overfitting. To fix this, statisticians invented Regularization—a mathematical penalty applied to the algorithm that essentially says: *"Minimize the error, but keep your coefficients as small and simple as possible."* In this chapter, we explore the two most powerful regularized models: Ridge and Lasso Regression.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain the concept of Regularization.
  • Understand the Alpha ($\alpha$) penalty parameter.
  • Implement Ridge Regression (L2 Regularization).
  • Implement Lasso Regression (L1 Regularization).
  • Use Lasso for automatic Feature Selection.

3. The Problem: Exploding Coefficients

When a model overfits (like a Degree-15 Polynomial), if you look at the coefficients ($m$), they will be massive numbers (e.g., $m_1 = 4,500,000, m_2 = -4,499,000$). The model is highly unstable. A tiny change in the input data causes the prediction to swing wildly. We must force these coefficients closer to zero.

4. Ridge Regression (L2 Regularization)

Ridge Regression adds a penalty to the Loss Function equal to the square of the coefficients.
  • How it works: It forces the algorithm to shrink all coefficients closer to zero. It will never force them to *exactly* zero, but it makes them very small.
  • When to use it: When you have a dataset with many features, and you believe *all* of them contribute slightly to the prediction. It prevents any single feature from dominating the model.
python
1234567891011
from sklearn.linear_model import Ridge
import numpy as np

# Assume X_train and y_train are preloaded and SCALED

# alpha is the penalty strength. 
# alpha=0 is standard Linear Regression. alpha=100 is a massive penalty.
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)

print(f"Ridge Coefficients: {ridge_model.coef_}")

5. Lasso Regression (L1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) adds a penalty equal to the absolute value of the coefficients.
  • How it works: Unlike Ridge, Lasso has a unique mathematical property: it forces the coefficients of useless features to become exactly 0.0.
  • When to use it: When you have a dataset with 500 features, and you suspect 400 of them are useless noise. Lasso will automatically delete the useless features by turning their weights to 0! It is an algorithm that performs its own Feature Selection.
python
12345678
from sklearn.linear_model import Lasso

# alpha is the penalty strength. Higher alpha = more coefficients turned to 0.
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)

print(f"Lasso Coefficients: {lasso_model.coef_}")
# Output might look like: [4.5, 0.0, 0.0, 1.2, 0.0] -> It deleted 3 features!

6. The Critical Importance of Scaling

WARNING: You MUST scale your features (using StandardScaler from Chapter 9) before using Ridge or Lasso. Because Regularization penalizes large coefficients, if Square_Feet is in the thousands and Bedrooms is 1 to 5, the raw coefficient for Square_Feet will naturally be tiny. Ridge/Lasso will penalize Bedrooms unfairly. Scaling ensures the penalty is applied equally to all features based on their true predictive power, not their raw numerical size.

7. Tuning the Alpha ($\alpha$) Parameter

The alpha (often called $\lambda$ or lambda in statistics) is a hyperparameter you must choose.
  • If alpha is too low (e.g., 0.0001): The penalty is too weak. The model acts like standard Linear Regression and overfits.
  • If alpha is too high (e.g., 1000): The penalty is too strong. The model shrinks all coefficients to near zero and completely Underfits.
*(We will learn how to automate finding the perfect alpha in Chapter 18 using GridSearchCV).*

8. Common Mistakes

  • Using Lasso on highly correlated features: If you have two features that are identical (e.g., Size in SqFt and Size in SqMeters), Lasso will randomly pick one and force the other to zero. Ridge will divide the weight equally between them.
  • Forgetting to scale: As mentioned, running Ridge or Lasso on raw, unscaled CSV data renders the regularization completely mathematically invalid.

9. Best Practices

  • Default to Ridge: If you don't know which to choose, start with Ridge Regression. It generally provides the most robust improvement over standard Linear Regression for preventing overfitting.

10. Exercises

  1. 1. Which regression algorithm (Ridge or Lasso) is capable of reducing a feature's coefficient to exactly 0.0?
  1. 2. If you increase the alpha parameter in a Ridge model from 1.0 to 50.0, what happens to the size of the model's coefficients?

11. MCQ Quiz with Answers

Question 1

What is the primary purpose of applying Regularization (Ridge/Lasso) to a regression model?

Question 2

Why is Lasso Regression often considered a "Feature Selection" tool?

12. Interview Questions

  • Q: Explain the mathematical difference between the penalty applied in Ridge Regression (L2) versus Lasso Regression (L1).
  • Q: Why is Feature Scaling (Standardization) an absolute mandatory prerequisite before training a Ridge or Lasso model?

13. FAQs

Q: Can I combine both Ridge and Lasso together? A: Yes! What if you want the robust shrinkage of Ridge, but the automatic feature deletion of Lasso? You combine them! This is called Elastic Net Regression, which we will cover in the very next chapter.

14. Summary

When a Linear or Polynomial model overfits, it generates massive, unstable coefficients. Regularization acts as a mathematical leash. By applying an alpha penalty, Ridge Regression shrinks all weights to create a highly stable, robust model. Lasso Regression applies a harsher penalty, actively driving useless features to zero, cleaning up noisy datasets automatically.

15. Next Chapter Recommendation

Ridge is great for stability. Lasso is great for feature deletion. Why choose between them? In Chapter 13: Elastic Net Regression, we will learn how to blend both L1 and L2 regularization into the ultimate linear model.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·