Skip to main content
Classification Algorithms
CHAPTER 04 Intermediate

NumPy, Pandas, and Data Preparation

Updated: May 16, 2026
6 min read

# CHAPTER 4

NumPy, Pandas, and Data Preparation

1. Introduction

A classification algorithm like Logistic Regression is essentially a giant mathematical equation. It cannot read Excel spreadsheets, and it crashes if it encounters blank cells. Before we can train a model, we must load, organize, and sanitize our data. NumPy provides the blazing-fast matrices required for the math, while Pandas acts as a programmable spreadsheet to wrangle the data. In this chapter, we will prepare our data for machine learning.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Create and manipulate multidimensional NumPy arrays.
  • Load datasets (CSVs) into Pandas DataFrames.
  • Explore dataset structure using .info() and .value_counts().
  • Filter rows based on specific conditions.
  • Handle missing data (NaN) without crashing your models.

3. NumPy Basics

NumPy (Numerical Python) is the foundation of all Python data science. Its core object is the ndarray (N-Dimensional Array), which is drastically faster than a standard Python list.
python
123456789101112131415
import numpy as np

# 1D Array (Vector) - e.g., a single column of ages
ages = np.array([25, 30, 35, 40])

# 2D Array (Matrix) - e.g., A dataset of patients
# Columns: [Age, Heart_Rate]
patient_data = np.array([
    [45, 80],
    [50, 85],
    [35, 70]
])

print(f"Matrix Shape: {patient_data.shape}") 
# Output: (3, 2) -> 3 rows (patients), 2 columns (features)

4. Pandas Basics and DataFrames

While NumPy handles pure math, it doesn't handle column names. Pandas wraps around NumPy to provide the DataFrame—a 2D table with rows and named columns.
python
1234567891011
import pandas as pd

# Creating a DataFrame manually
data = {
    "Age": [25, 45, 30],
    "Income": [50000, 80000, 60000],
    "Defaulted": [0, 1, 0] # 1=Defaulted on loan, 0=Paid
}

df = pd.DataFrame(data)
print(df)

5. Reading CSV Files and Exploration

In reality, you will load massive datasets from CSV files. Once loaded, you must explore the data to understand the classification task.
python
123456789101112131415
# Load a CSV file (Assuming 'bank_data.csv' exists)
# df = pd.read_csv("bank_data.csv")

# View the first 5 rows
print(df.head())

# View dataset information (Column names, data types)
print(df.info())

# CRITICAL FOR CLASSIFICATION: Check class balance!
# How many people Defaulted vs Paid?
print(df['Defaulted'].value_counts())
# Output might be: 
# 0    850
# 1    150

6. Data Filtering

Pandas allows you to query your data to find specific subgroups.
python
12345678
# Get a single column
incomes = df['Income']

# Filter rows: Find all customers older than 40
older_customers = df[df['Age'] > 40]

# Multiple conditions: Older than 40 AND Defaulted
high_risk = df[(df['Age'] > 40) & (df['Defaulted'] == 1)]

7. Handling Missing Data

If a CSV cell is blank, Pandas loads it as NaN (Not a Number). If you feed NaN into Scikit-learn, the algorithm will crash immediately.
python
123456789
# Check how many missing values are in each column
print(df.isnull().sum())

# Strategy 1: Drop any row containing NaN (Quick, but loses data)
df_clean = df.dropna()

# Strategy 2: Imputation (Fill missing values with the column's mean)
mean_income = df['Income'].mean()
df['Income'].fillna(mean_income, inplace=True)

8. Mini Project: Prepare Dataset for ML

Before training, we must split our DataFrame into two exact pieces: The Input Features (X) and the Target Label (y).
python
12345678910111213141516171819
import pandas as pd

# Mock Data
df = pd.DataFrame({
    "Age": [22, 45, 30, 55, 28],
    "Credit_Score": [600, 750, 680, 800, 590],
    "Approved": [0, 1, 1, 1, 0] # 1=Approved for credit, 0=Denied
})

# 1. Isolate the Features (X)
# We drop the Target column, keeping ONLY the inputs
X = df.drop("Approved", axis=1)

# 2. Isolate the Target Label (y)
# This is the single column the algorithm must learn to predict
y = df["Approved"]

print("Features (X) shape:", X.shape) # Output: (5, 2)
print("Target (y) shape:", y.shape)   # Output: (5,)

*Your data is now mathematically separated and ready to be fed into a Classification algorithm!*

9. Common Mistakes

  • Confusing loc and iloc: In Pandas, if you want to select a row by its exact integer position in the list (e.g., the 5th row), use df.iloc[5]. If you use df.loc[5], it searches for a row whose literal name/index label is "5".
  • Forgetting axis=1: When dropping a column (df.drop('Approved')), Pandas defaults to looking for a *row* named 'Approved' and will throw an error. You must specify axis=1 to tell it to drop a column.

10. Best Practices

  • Always check value_counts(): In Classification, knowing if your target variable is balanced is critical. If your dataset has 990 "Not Fraud" rows and only 10 "Fraud" rows, standard algorithms will fail. (We dedicate Chapter 14 entirely to fixing this issue!).

11. Exercises

  1. 1. If you load a DataFrame and df.shape returns (5000, 12), what does that tell you about the dataset?
  1. 2. Write the Pandas code to find out how many rows contain missing values (NaN) in the entire DataFrame.

12. MCQ Quiz with Answers

Question 1

What is the primary difference in functionality between NumPy and Pandas?

Question 2

When preparing data for a Classification model, what does the y variable represent?

13. Interview Questions

  • Q: Explain why feeding NaN (missing values) directly into a standard machine learning model causes it to crash, and describe two strategies to handle them using Pandas.
  • Q: What is the purpose of separating a dataset into X and y matrices?

14. FAQs

Q: Do I need to convert my Pandas DataFrames into NumPy arrays manually before training? A: In older versions of Python, yes. However, modern scikit-learn algorithms accept Pandas DataFrames directly and will automatically convert them to NumPy arrays internally!

15. Summary

Data wrangling is the unglamorous but vital reality of Data Science. By mastering Pandas DataFrames, exploring class distributions, handling missing NaN values safely, and mathematically isolating your X features from your y target, you guarantee that your algorithms will receive clean, structured data.

16. Next Chapter Recommendation

We have the tools and the clean data. But how does an algorithm actually separate "Spam" from "Not Spam"? In Chapter 5: Understanding Classification Fundamentals, we will dive into the core intuition behind drawing decision boundaries and the battle of Overfitting vs Underfitting.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·