Skip to main content
Regression Models
CHAPTER 18 Intermediate

Hyperparameter Tuning and Cross Validation

Updated: May 16, 2026
6 min read

# CHAPTER 18

Hyperparameter Tuning and Cross Validation

1. Introduction

When you instantiate RandomForestRegressor(), it uses 100 trees (n_estimators=100) by default. But what if your specific dataset needs 300 trees to achieve maximum accuracy? These configuration settings are called Hyperparameters. Guessing them manually is impossible. We must automate the search. Furthermore, if we test our model on just one "Test Set," we might get lucky. In this chapter, we will learn how to systematically optimize our models using Cross-Validation and Grid Search.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain the concept of K-Fold Cross Validation.
  • Differentiate between Parameters and Hyperparameters.
  • Define a Hyperparameter Grid.
  • Implement GridSearchCV to automate model tuning.
  • Extract the "Best Model" from the search results.

3. The Flaw of the Train/Test Split

Normally, we split data 80% for Training and 20% for Testing. *The Problem:* What if, purely by chance, all the most expensive houses end up in the 20% Test Set? The model will perform terribly, not because the algorithm is bad, but because it got unlucky with the split. We solve this using K-Fold Cross Validation.

4. K-Fold Cross Validation

Instead of doing one split, we chop the dataset into 5 equal chunks (Folds).
  1. 1. We train on Folds 1, 2, 3, 4. We test on Fold 5. Record the score.
  1. 2. We train on Folds 1, 2, 3, 5. We test on Fold 4. Record the score.
  1. 3. We repeat this until every fold has been used as the Test Set exactly once.
  1. 4. We take the Average of the 5 scores.
This guarantees an unbiased, statistically bulletproof evaluation of the model!

5. What is a Hyperparameter?

  • Parameters: The numbers the model calculates on its own during training (like the slope $m$ and intercept $b$). You cannot change these.
  • Hyperparameters: The "knobs and dials" you set *before* training begins (like max_depth in a Tree, or alpha in Ridge Regression).

6. Mini Project: Automating the Search (GridSearchCV)

Let's tune a Random Forest. Does it want 50, 100, or 200 trees? Does it want a max_depth of 5, 10, or None? We will use GridSearchCV (Grid Search Cross Validation) to test every single combination automatically!
python
1234567891011121314151617181920212223242526272829303132
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Assume X_train and y_train are preloaded datasets

# 1. Initialize a blank model
base_model = RandomForestRegressor(random_state=42)

# 2. Define the "Grid" of Hyperparameters to test
# The dictionary keys MUST match the exact spelling in scikit-learn
param_grid = {
    'n_estimators': [50, 100, 200],   # Test 3 different forest sizes
    'max_depth': [None, 5, 10],       # Test 3 different depth limits
    'min_samples_split': [2, 5]       # Test 2 splitting rules
}
# Total Combinations = 3 x 3 x 2 = 18 combinations.

# 3. Setup GridSearchCV
# cv=5 means use 5-Fold Cross Validation for EVERY combination.
# Total models trained = 18 combinations * 5 folds = 90 models!
grid_search = GridSearchCV(estimator=base_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# 4. Start the automated search!
print("Searching for optimal hyperparameters. This may take a moment...")
grid_search.fit(X_train, y_train)

# 5. Extract the results
print(f"The best hyperparameter combination is: {grid_search.best_params_}")

# You can now grab the "Ultimate" perfectly tuned model directly!
best_model = grid_search.best_estimator_

7. RandomizedSearchCV (The Faster Alternative)

If you have 10 hyperparameters with 5 options each, Grid Search will train 100,000 models. Your computer will freeze for days. RandomizedSearchCV is the solution. You give it the exact same grid, but you tell it: *"Only test 50 random combinations and give me the best one."* It runs 100x faster and almost always finds a combination that is 99% as good as a full Grid Search.

8. Common Mistakes

  • Data Leakage in CV: If you scale your entire dataset (StandardScaler.fit_transform(X)) *before* putting it into GridSearchCV, information leaks across the folds. You must pass a Pipeline (containing the scaler and the model) into GridSearchCV to ensure scaling happens independently inside each fold!
  • Over-tuning: Testing n_estimators from 1 to 1000 in increments of 1 is a waste of computing power. Trees don't care about the difference between 101 and 102 trees. Jump by large logical chunks.

9. Best Practices

  • Use n_jobs=-1: In the GridSearchCV function, setting n_jobs=-1 tells scikit-learn to use 100% of your computer's CPU cores to train the 90 models in parallel. It will speed up the search dramatically!

10. Exercises

  1. 1. If your param_grid contains 4 options for alpha and 5 options for l1_ratio, and you set cv=3, exactly how many models will GridSearchCV train?
  1. 2. Explain why a Cross-Validated score is more trustworthy than a single Train/Test split score.

11. MCQ Quiz with Answers

Question 1

What is the difference between a Parameter and a Hyperparameter in Machine Learning?

Question 2

What is the primary purpose of K-Fold Cross Validation?

12. Interview Questions

  • Q: Explain the mechanics of a 5-Fold Cross Validation process step-by-step.
  • Q: If a Grid Search is taking 48 hours to run, what specific scikit-learn alternative class would you use to drastically reduce compute time while maintaining accuracy?

13. FAQs

Q: Does GridSearchCV automatically retrain the model on the full dataset at the end? A: Yes! By default, once it finds the best combination of hyperparameters, it automatically retrains a final model using those exact settings on 100% of the training data you provided.

14. Summary

A data scientist does not guess settings; they automate the search for perfection. By leveraging K-Fold Cross Validation, we guarantee our metrics are honest. By utilizing GridSearchCV, we force the computer to iterate through massive grids of hyperparameter combinations, resulting in an algorithm that is mathematically optimized for our exact dataset.

15. Next Chapter Recommendation

You have found the perfect model, trained it, and proven its accuracy. But right now, it only exists in your laptop's memory. If you close Python, it is deleted. In Chapter 19: Saving, Deploying, and Using Regression Models, we will learn how to serialize models and build an API so the world can use your AI.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·