Data Preprocessing for Machine Learning
# CHAPTER 22
Data Preprocessing for Machine Learning
1. Chapter Introduction
Machine Learning algorithms are mathematical equations. They cannot do math on the word "Female", nor can they understand a missingNaN value. Furthermore, if you train an algorithm on all your data, you have no data left to test if it actually works. This chapter covers Preprocessing—the mandatory steps required to transform a clean Pandas DataFrame into a format Scikit-Learn can actually ingest.
2. The Train-Test Split
If you teach a student the answers to a specific test, and then give them that exact same test, an A+ doesn't mean they are smart—it means they memorized it.
In ML, we must hide a portion of our data (usually 20%) to test the model later.
3. Encoding Categorical Variables
Algorithms only understand numbers. If you have a column named City with values "NY", "LA", and "SF", you must convert them to numbers.
One-Hot Encoding (pd.get_dummies) This creates a new binary (1 or 0) column for every unique category.
4. Feature Scaling (Standardization)
Imagine predicting house prices using Bedrooms (range 1-5) and Square_Footage (range 1000-5000). The algorithm might think Square_Footage is 1000x more important just because the numbers are bigger.
We must Scale the features so they are on a similar playing field (usually a mean of 0 and standard deviation of 1).
5. Mini Project: The ML Preprocessing Pipeline
Let's prep a messy dataset for Machine Learning.
6. Common Mistakes
-
Data Leakage: Fitting the
StandardScaleron the *entire* dataset before splitting it. The scaler will learn the average of the test data, leaking information about the "hidden" test set to the model. Always split first, thenfit_transformthe training data, and onlytransformthe test data.
-
Forgetting
random_state: If you don't set this,train_test_splitshuffles the data differently every time you run the code, making it impossible to tell if your model is actually getting better.
7. MCQs
Why do we perform a Train-Test split?
What is the standard ratio for a Train-Test split?
What parameter in train_test_split ensures reproducible results?
What is the process of converting text categories (like "Red", "Blue") into numerical columns (1s and 0s)?
Why is Feature Scaling (like StandardScaler) necessary?
When using StandardScaler, how should you handle the Test data?
What happens if you pass a DataFrame containing NaN values into a Scikit-Learn model?
What does pd.get_dummies(df) do?
In the preprocessing pipeline, what is Step 1?
What is Data Leakage?
8. Interview Questions
- Q: Explain the concept of Data Leakage. How does fitting a Scaler before doing a Train-Test split cause it?
- Q: Why do algorithms require One-Hot Encoding for categorical variables?
9. Summary
Preprocessing bridges the gap between Pandas and Scikit-Learn. You must cleanNaNs, convert text categories to binary numbers (pd.get_dummies), and separate your data into Training and Testing sets (train_test_split). Finally, apply StandardScaler to ensure large numbers (like Salaries) don't overpower small numbers (like Age) during the model's mathematical calculations.