CHAPTER 04 Intermediate

NumPy, Pandas, and Data Preparation

Updated: May 16, 2026

6 min read

# CHAPTER 4

NumPy, Pandas, and Data Preparation

1. Introduction

A classification algorithm like Logistic Regression is essentially a giant mathematical equation. It cannot read Excel spreadsheets, and it crashes if it encounters blank cells. Before we can train a model, we must load, organize, and sanitize our data. NumPy provides the blazing-fast matrices required for the math, while Pandas acts as a programmable spreadsheet to wrangle the data. In this chapter, we will prepare our data for machine learning.

2. Learning Objectives

By the end of this chapter, you will be able to:

Create and manipulate multidimensional NumPy arrays.

Load datasets (CSVs) into Pandas DataFrames.

Explore dataset structure using .info() and .value_counts().

Filter rows based on specific conditions.

Handle missing data (NaN) without crashing your models.

3. NumPy Basics

NumPy (Numerical Python) is the foundation of all Python data science. Its core object is the ndarray (N-Dimensional Array), which is drastically faster than a standard Python list.

python

123456789101112131415

import numpy as np

# 1D Array (Vector) - e.g., a single column of ages
ages = np.array([25, 30, 35, 40])

# 2D Array (Matrix) - e.g., A dataset of patients
# Columns: [Age, Heart_Rate]
patient_data = np.array([
    [45, 80],
    [50, 85],
    [35, 70]
])

print(f"Matrix Shape: {patient_data.shape}") 
# Output: (3, 2) -> 3 rows (patients), 2 columns (features)

4. Pandas Basics and DataFrames

While NumPy handles pure math, it doesn't handle column names. Pandas wraps around NumPy to provide the DataFrame—a 2D table with rows and named columns.

python

1234567891011

import pandas as pd

# Creating a DataFrame manually
data = {
    "Age": [25, 45, 30],
    "Income": [50000, 80000, 60000],
    "Defaulted": [0, 1, 0] # 1=Defaulted on loan, 0=Paid
}

df = pd.DataFrame(data)
print(df)

5. Reading CSV Files and Exploration

In reality, you will load massive datasets from CSV files. Once loaded, you must explore the data to understand the classification task.

python

123456789101112131415

# Load a CSV file (Assuming 'bank_data.csv' exists)
# df = pd.read_csv("bank_data.csv")

# View the first 5 rows
print(df.head())

# View dataset information (Column names, data types)
print(df.info())

# CRITICAL FOR CLASSIFICATION: Check class balance!
# How many people Defaulted vs Paid?
print(df[&#039;Defaulted'].value_counts())
# Output might be: 
# 0    850
# 1    150

6. Data Filtering

Pandas allows you to query your data to find specific subgroups.

python

12345678

# Get a single column
incomes = df[&#039;Income']

# Filter rows: Find all customers older than 40
older_customers = df[df[&#039;Age'] > 40]

# Multiple conditions: Older than 40 AND Defaulted
high_risk = df[(df[&#039;Age'] > 40) & (df['Defaulted'] == 1)]

7. Handling Missing Data

If a CSV cell is blank, Pandas loads it as NaN (Not a Number). If you feed NaN into Scikit-learn, the algorithm will crash immediately.

python

123456789

# Check how many missing values are in each column
print(df.isnull().sum())

# Strategy 1: Drop any row containing NaN (Quick, but loses data)
df_clean = df.dropna()

# Strategy 2: Imputation (Fill missing values with the column's mean)
mean_income = df[&#039;Income'].mean()
df[&#039;Income'].fillna(mean_income, inplace=True)

8. Mini Project: Prepare Dataset for ML

Before training, we must split our DataFrame into two exact pieces: The Input Features (X) and the Target Label (y).

python

12345678910111213141516171819

import pandas as pd

# Mock Data
df = pd.DataFrame({
    "Age": [22, 45, 30, 55, 28],
    "Credit_Score": [600, 750, 680, 800, 590],
    "Approved": [0, 1, 1, 1, 0] # 1=Approved for credit, 0=Denied
})

# 1. Isolate the Features (X)
# We drop the Target column, keeping ONLY the inputs
X = df.drop("Approved", axis=1)

# 2. Isolate the Target Label (y)
# This is the single column the algorithm must learn to predict
y = df["Approved"]

print("Features (X) shape:", X.shape) # Output: (5, 2)
print("Target (y) shape:", y.shape)   # Output: (5,)

*Your data is now mathematically separated and ready to be fed into a Classification algorithm!*

9. Common Mistakes

Confusing loc and iloc: In Pandas, if you want to select a row by its exact integer position in the list (e.g., the 5th row), use df.iloc[5]. If you use df.loc[5], it searches for a row whose literal name/index label is "5".

Forgetting axis=1: When dropping a column (df.drop('Approved')), Pandas defaults to looking for a *row* named 'Approved' and will throw an error. You must specify axis=1 to tell it to drop a column.

10. Best Practices

Always check value_counts(): In Classification, knowing if your target variable is balanced is critical. If your dataset has 990 "Not Fraud" rows and only 10 "Fraud" rows, standard algorithms will fail. (We dedicate Chapter 14 entirely to fixing this issue!).

11. Exercises

1. If you load a DataFrame and df.shape returns (5000, 12), what does that tell you about the dataset?

2. Write the Pandas code to find out how many rows contain missing values (NaN) in the entire DataFrame.

12. MCQ Quiz with Answers

Question 1

What is the primary difference in functionality between NumPy and Pandas?

Question 2

When preparing data for a Classification model, what does the `y` variable represent?

13. Interview Questions

Q: Explain why feeding NaN (missing values) directly into a standard machine learning model causes it to crash, and describe two strategies to handle them using Pandas.

Q: What is the purpose of separating a dataset into X and y matrices?

14. FAQs

Q: Do I need to convert my Pandas DataFrames into NumPy arrays manually before training? A: In older versions of Python, yes. However, modern scikit-learn algorithms accept Pandas DataFrames directly and will automatically convert them to NumPy arrays internally!

15. Summary

Data wrangling is the unglamorous but vital reality of Data Science. By mastering Pandas DataFrames, exploring class distributions, handling missing NaN values safely, and mathematically isolating your X features from your y target, you guarantee that your algorithms will receive clean, structured data.

16. Next Chapter Recommendation

We have the tools and the clean data. But how does an algorithm actually separate "Spam" from "Not Spam"? In Chapter 5: Understanding Classification Fundamentals, we will dive into the core intuition behind drawing decision boundaries and the battle of Overfitting vs Underfitting.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

NumPy, Pandas, and Data Preparation #

1. Introduction #

2. Learning Objectives #

3. NumPy Basics #

4. Pandas Basics and DataFrames #

5. Reading CSV Files and Exploration #

6. Data Filtering #

7. Handling Missing Data #

8. Mini Project: Prepare Dataset for ML #

9. Common Mistakes #

10. Best Practices #

11. Exercises #

12. MCQ Quiz with Answers #

What is the primary difference in functionality between NumPy and Pandas?

When preparing data for a Classification model, what does the y variable represent?

13. Interview Questions #

14. FAQs #

15. Summary #

16. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 4

🧪 Related Labs 2

Send Feedback / Bug

Feedback Submitted!

NumPy, Pandas, and Data Preparation

1. Introduction

2. Learning Objectives

3. NumPy Basics

4. Pandas Basics and DataFrames

5. Reading CSV Files and Exploration

6. Data Filtering

7. Handling Missing Data

8. Mini Project: Prepare Dataset for ML

9. Common Mistakes

10. Best Practices

11. Exercises

12. MCQ Quiz with Answers

When preparing data for a Classification model, what does the `y` variable represent?

13. Interview Questions

14. FAQs

15. Summary

16. Next Chapter Recommendation