CHAPTER 18
Intermediate
Hyperparameter Tuning and Cross Validation
Updated: May 16, 2026
6 min read
# CHAPTER 18
Hyperparameter Tuning and Cross Validation
1. Introduction
When you instantiateRandomForestRegressor(), it uses 100 trees (n_estimators=100) by default. But what if your specific dataset needs 300 trees to achieve maximum accuracy? These configuration settings are called Hyperparameters. Guessing them manually is impossible. We must automate the search. Furthermore, if we test our model on just one "Test Set," we might get lucky. In this chapter, we will learn how to systematically optimize our models using Cross-Validation and Grid Search.
2. Learning Objectives
By the end of this chapter, you will be able to:- Explain the concept of K-Fold Cross Validation.
- Differentiate between Parameters and Hyperparameters.
- Define a Hyperparameter Grid.
-
Implement
GridSearchCVto automate model tuning.
- Extract the "Best Model" from the search results.
3. The Flaw of the Train/Test Split
Normally, we split data 80% for Training and 20% for Testing. *The Problem:* What if, purely by chance, all the most expensive houses end up in the 20% Test Set? The model will perform terribly, not because the algorithm is bad, but because it got unlucky with the split. We solve this using K-Fold Cross Validation.4. K-Fold Cross Validation
Instead of doing one split, we chop the dataset into 5 equal chunks (Folds).- 1. We train on Folds 1, 2, 3, 4. We test on Fold 5. Record the score.
- 2. We train on Folds 1, 2, 3, 5. We test on Fold 4. Record the score.
- 3. We repeat this until every fold has been used as the Test Set exactly once.
- 4. We take the Average of the 5 scores.
5. What is a Hyperparameter?
- Parameters: The numbers the model calculates on its own during training (like the slope $m$ and intercept $b$). You cannot change these.
-
Hyperparameters: The "knobs and dials" you set *before* training begins (like
max_depthin a Tree, oralphain Ridge Regression).
6. Mini Project: Automating the Search (GridSearchCV)
Let's tune a Random Forest. Does it want 50, 100, or 200 trees? Does it want amax_depth of 5, 10, or None? We will use GridSearchCV (Grid Search Cross Validation) to test every single combination automatically!
python
7. RandomizedSearchCV (The Faster Alternative)
If you have 10 hyperparameters with 5 options each, Grid Search will train 100,000 models. Your computer will freeze for days. RandomizedSearchCV is the solution. You give it the exact same grid, but you tell it: *"Only test 50 random combinations and give me the best one."* It runs 100x faster and almost always finds a combination that is 99% as good as a full Grid Search.8. Common Mistakes
-
Data Leakage in CV: If you scale your entire dataset (
StandardScaler.fit_transform(X)) *before* putting it intoGridSearchCV, information leaks across the folds. You must pass aPipeline(containing the scaler and the model) into GridSearchCV to ensure scaling happens independently inside each fold!
-
Over-tuning: Testing
n_estimatorsfrom 1 to 1000 in increments of 1 is a waste of computing power. Trees don't care about the difference between 101 and 102 trees. Jump by large logical chunks.
9. Best Practices
-
Use
n_jobs=-1: In theGridSearchCVfunction, settingn_jobs=-1tellsscikit-learnto use 100% of your computer's CPU cores to train the 90 models in parallel. It will speed up the search dramatically!
10. Exercises
-
1.
If your
param_gridcontains 4 options foralphaand 5 options forl1_ratio, and you setcv=3, exactly how many models will GridSearchCV train?
- 2. Explain why a Cross-Validated score is more trustworthy than a single Train/Test split score.
11. MCQ Quiz with Answers
Question 1
What is the difference between a Parameter and a Hyperparameter in Machine Learning?
Question 2
What is the primary purpose of K-Fold Cross Validation?
12. Interview Questions
- Q: Explain the mechanics of a 5-Fold Cross Validation process step-by-step.
-
Q: If a Grid Search is taking 48 hours to run, what specific
scikit-learnalternative class would you use to drastically reduce compute time while maintaining accuracy?
13. FAQs
Q: Does GridSearchCV automatically retrain the model on the full dataset at the end? A: Yes! By default, once it finds the best combination of hyperparameters, it automatically retrains a final model using those exact settings on 100% of the training data you provided.14. Summary
A data scientist does not guess settings; they automate the search for perfection. By leveraging K-Fold Cross Validation, we guarantee our metrics are honest. By utilizingGridSearchCV, we force the computer to iterate through massive grids of hyperparameter combinations, resulting in an algorithm that is mathematically optimized for our exact dataset.