CHAPTER 04
Intermediate
NumPy, Pandas, and Data Preparation
Updated: May 16, 2026
6 min read
# CHAPTER 4
NumPy, Pandas, and Data Preparation
1. Introduction
A classification algorithm like Logistic Regression is essentially a giant mathematical equation. It cannot read Excel spreadsheets, and it crashes if it encounters blank cells. Before we can train a model, we must load, organize, and sanitize our data. NumPy provides the blazing-fast matrices required for the math, while Pandas acts as a programmable spreadsheet to wrangle the data. In this chapter, we will prepare our data for machine learning.2. Learning Objectives
By the end of this chapter, you will be able to:- Create and manipulate multidimensional NumPy arrays.
- Load datasets (CSVs) into Pandas DataFrames.
-
Explore dataset structure using
.info()and.value_counts().
- Filter rows based on specific conditions.
- Handle missing data (NaN) without crashing your models.
3. NumPy Basics
NumPy (Numerical Python) is the foundation of all Python data science. Its core object is thendarray (N-Dimensional Array), which is drastically faster than a standard Python list.
python
4. Pandas Basics and DataFrames
While NumPy handles pure math, it doesn't handle column names. Pandas wraps around NumPy to provide the DataFrame—a 2D table with rows and named columns.
python
5. Reading CSV Files and Exploration
In reality, you will load massive datasets from CSV files. Once loaded, you must explore the data to understand the classification task.
python
6. Data Filtering
Pandas allows you to query your data to find specific subgroups.
python
7. Handling Missing Data
If a CSV cell is blank, Pandas loads it asNaN (Not a Number). If you feed NaN into Scikit-learn, the algorithm will crash immediately.
python
8. Mini Project: Prepare Dataset for ML
Before training, we must split our DataFrame into two exact pieces: The Input Features (X) and the Target Label (y).
python
*Your data is now mathematically separated and ready to be fed into a Classification algorithm!*
9. Common Mistakes
-
Confusing
locandiloc: In Pandas, if you want to select a row by its exact integer position in the list (e.g., the 5th row), usedf.iloc[5]. If you usedf.loc[5], it searches for a row whose literal name/index label is "5".
-
Forgetting
axis=1: When dropping a column (df.drop('Approved')), Pandas defaults to looking for a *row* named 'Approved' and will throw an error. You must specifyaxis=1to tell it to drop a column.
10. Best Practices
-
Always check
value_counts(): In Classification, knowing if your target variable is balanced is critical. If your dataset has 990 "Not Fraud" rows and only 10 "Fraud" rows, standard algorithms will fail. (We dedicate Chapter 14 entirely to fixing this issue!).
11. Exercises
-
1.
If you load a DataFrame and
df.shapereturns(5000, 12), what does that tell you about the dataset?
-
2.
Write the Pandas code to find out how many rows contain missing values (
NaN) in the entire DataFrame.
12. MCQ Quiz with Answers
Question 1
What is the primary difference in functionality between NumPy and Pandas?
Question 2
When preparing data for a Classification model, what does the y variable represent?
13. Interview Questions
-
Q: Explain why feeding
NaN(missing values) directly into a standard machine learning model causes it to crash, and describe two strategies to handle them using Pandas.
-
Q: What is the purpose of separating a dataset into
Xandymatrices?
14. FAQs
Q: Do I need to convert my Pandas DataFrames into NumPy arrays manually before training? A: In older versions of Python, yes. However, modernscikit-learn algorithms accept Pandas DataFrames directly and will automatically convert them to NumPy arrays internally!
15. Summary
Data wrangling is the unglamorous but vital reality of Data Science. By mastering Pandas DataFrames, exploring class distributions, handling missingNaN values safely, and mathematically isolating your X features from your y target, you guarantee that your algorithms will receive clean, structured data.