CHAPTER 22 Beginner

Data Preprocessing for Machine Learning

Updated: May 18, 2026

5 min read

# CHAPTER 22

Data Preprocessing for Machine Learning

1. Chapter Introduction

Machine Learning algorithms are mathematical equations. They cannot do math on the word "Female", nor can they understand a missing NaN value. Furthermore, if you train an algorithm on all your data, you have no data left to test if it actually works. This chapter covers Preprocessing—the mandatory steps required to transform a clean Pandas DataFrame into a format Scikit-Learn can actually ingest.

2. The Train-Test Split

If you teach a student the answers to a specific test, and then give them that exact same test, an A+ doesn't mean they are smart—it means they memorized it.

In ML, we must hide a portion of our data (usually 20%) to test the model later.

python

12345678910111213141516171819

from sklearn.model_selection import train_test_split
import pandas as pd

# Assume 'df' is our clean DataFrame
df = pd.DataFrame({
    &#039;Age': [25, 30, 35, 40, 45],
    &#039;Salary': [50k, 60k, 70k, 80k, 90k],
    &#039;Bought_Product': [0, 1, 0, 1, 1]
})

# 1. Separate Features (X) and Target (y)
X = df[[&#039;Age', 'Salary']]
y = df[&#039;Bought_Product']

# 2. Split the data (80% Train, 20% Test)
# random_state ensures we get the same split every time we run the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training rows: {len(X_train)} | Testing rows: {len(X_test)}")

3. Encoding Categorical Variables

Algorithms only understand numbers. If you have a column named City with values "NY", "LA", and "SF", you must convert them to numbers.

One-Hot Encoding (pd.get_dummies) This creates a new binary (1 or 0) column for every unique category.

python

123456789101112

data = pd.DataFrame({&#039;City': ['NY', 'LA', 'NY', 'SF']})

# Pandas converts 'City' into 3 separate columns: City_NY, City_LA, City_SF
encoded_data = pd.get_dummies(data, columns=[&#039;City'])

print(encoded_data)
# Output:
#    City_LA  City_NY  City_SF
# 0        0        1        0
# 1        1        0        0
# 2        0        1        0
# 3        0        0        1

4. Feature Scaling (Standardization)

Imagine predicting house prices using Bedrooms (range 1-5) and Square_Footage (range 1000-5000). The algorithm might think Square_Footage is 1000x more important just because the numbers are bigger.

We must Scale the features so they are on a similar playing field (usually a mean of 0 and standard deviation of 1).

python

12345678910

from sklearn.preprocessing import StandardScaler

# 1. Initialize Scaler
scaler = StandardScaler()

# 2. Fit the scaler on the TRAINING data, and transform it
X_train_scaled = scaler.fit_transform(X_train)

# 3. Transform the TEST data (DO NOT fit it again!)
X_test_scaled = scaler.transform(X_test)

5. Mini Project: The ML Preprocessing Pipeline

Let's prep a messy dataset for Machine Learning.

python

123456789101112131415161718192021222324252627282930

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

raw = pd.DataFrame({
    &#039;Age': [22, 38, None, 55],
    &#039;Gender': ['M', 'F', 'F', 'M'],
    &#039;Income': [45000, 80000, 60000, 120000],
    &#039;Approved': [0, 1, 1, 1] # Target
})

# 1. Clean NaNs (Scikit-Learn hates NaNs)
raw[&#039;Age'] = raw['Age'].fillna(raw['Age'].mean())

# 2. Encode Categories
encoded = pd.get_dummies(raw, columns=[&#039;Gender'], drop_first=True) # drop_first prevents multicollinearity

# 3. Define X and y
X = encoded.drop(columns=[&#039;Approved'])
y = encoded[&#039;Approved']

# 4. Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data is 100% ready for model.fit()!")

6. Common Mistakes

Data Leakage: Fitting the StandardScaler on the *entire* dataset before splitting it. The scaler will learn the average of the test data, leaking information about the "hidden" test set to the model. Always split first, then fit_transform the training data, and only transform the test data.

Forgetting random_state: If you don't set this, train_test_split shuffles the data differently every time you run the code, making it impossible to tell if your model is actually getting better.

7. MCQs

Question 1

Why do we perform a Train-Test split?

Question 2

What is the standard ratio for a Train-Test split?

Question 3

What parameter in `train_test_split` ensures reproducible results?

Question 4

What is the process of converting text categories (like "Red", "Blue") into numerical columns (1s and 0s)?

Question 5

Why is Feature Scaling (like `StandardScaler`) necessary?

Question 6

When using `StandardScaler`, how should you handle the Test data?

Question 7

What happens if you pass a DataFrame containing `NaN` values into a Scikit-Learn model?

Question 8

What does `pd.get_dummies(df)` do?

Question 9

In the preprocessing pipeline, what is Step 1?

Question 10

What is Data Leakage?

8. Interview Questions

Q: Explain the concept of Data Leakage. How does fitting a Scaler before doing a Train-Test split cause it?

Q: Why do algorithms require One-Hot Encoding for categorical variables?

9. Summary

Preprocessing bridges the gap between Pandas and Scikit-Learn. You must clean NaNs, convert text categories to binary numbers (pd.get_dummies), and separate your data into Training and Testing sets (train_test_split). Finally, apply StandardScaler to ensure large numbers (like Salaries) don't overpower small numbers (like Age) during the model's mathematical calculations.

10. Next Chapter Recommendation

In Chapter 23: Regression Algorithms, we will finally train our first machine learning model! We will use Linear Regression to predict continuous numbers, such as house prices, based on our preprocessed features.

Explore More

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Data Preprocessing for Machine Learning #

1. Chapter Introduction #

2. The Train-Test Split #

3. Encoding Categorical Variables #

4. Feature Scaling (Standardization) #

5. Mini Project: The ML Preprocessing Pipeline #

6. Common Mistakes #

7. MCQs #

Why do we perform a Train-Test split?

What is the standard ratio for a Train-Test split?

What parameter in train_test_split ensures reproducible results?

What is the process of converting text categories (like "Red", "Blue") into numerical columns (1s and 0s)?

Why is Feature Scaling (like StandardScaler) necessary?

When using StandardScaler, how should you handle the Test data?

What happens if you pass a DataFrame containing NaN values into a Scikit-Learn model?

What does pd.get_dummies(df) do?

In the preprocessing pipeline, what is Step 1?

What is Data Leakage?

8. Interview Questions #

9. Summary #

10. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🎥 Related Videos 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Data Preprocessing for Machine Learning

1. Chapter Introduction

2. The Train-Test Split

3. Encoding Categorical Variables

4. Feature Scaling (Standardization)

5. Mini Project: The ML Preprocessing Pipeline

6. Common Mistakes

7. MCQs

What parameter in `train_test_split` ensures reproducible results?

Why is Feature Scaling (like `StandardScaler`) necessary?

When using `StandardScaler`, how should you handle the Test data?

What happens if you pass a DataFrame containing `NaN` values into a Scikit-Learn model?

What does `pd.get_dummies(df)` do?

8. Interview Questions

9. Summary

10. Next Chapter Recommendation