Skip to main content
Data Cleaning
CHAPTER 18 Beginner

Performance Optimization for Large Datasets

Updated: May 18, 2026
5 min read

# CHAPTER 18

Performance Optimization for Large Datasets

1. Chapter Introduction

When your dataset has 10,000 rows, any cleaning code will run instantly. When your dataset has 50 million rows, inefficient code will crash your computer with an OutOfMemory error, or take 4 hours to run. This chapter teaches you how to optimize Pandas for speed and memory, enabling you to clean massive datasets on a standard laptop.

2. The Problem: Memory Limits

Pandas loads data entirely into your computer's RAM. If you have 8GB of RAM, and you try to load a 10GB CSV file using pd.read_csv(), your kernel will crash. Furthermore, Pandas usually requires 2x to 3x the size of the dataset in RAM to perform operations (like sorting or merging).

3. Optimization 1: Data Type Downcasting

By default, Pandas assigns the largest possible memory bucket to numbers (int64 and float64). If your column represents "Age" (0 to 120), using a 64-bit integer is a massive waste of memory. An 8-bit integer (int8) can store numbers up to 127 and uses 8x less memory.

python
1234567891011121314151617181920212223242526
import pandas as pd
import numpy as np

# Create a sample large dataset
df = pd.DataFrame({
    'id': np.random.randint(1, 1000000, 1000000), # Large numbers
    'age': np.random.randint(1, 100, 1000000),    # Small numbers (max 100)
    'status': np.random.choice(['Active', 'Pending', 'Closed'], 1000000) # Text
})

print("=== MEMORY BEFORE OPTIMIZATION ===")
df.info(memory_usage='deep')
# You will see memory usage is around 68 MB.

# Downcast Integers
df['age'] = pd.to_numeric(df['age'], downcast='integer')
# Pandas automatically chooses int8 because max value is < 127

# Convert low-cardinality text to Categorical
# Instead of storing the word "Active" 300,000 times, it stores it once in a dictionary 
# and replaces the column with tiny integers (0, 1, 2)
df[&#039;status'] = df['status'].astype('category')

print("\n=== MEMORY AFTER OPTIMIZATION ===")
df.info(memory_usage=&#039;deep')
# Memory drops significantly (e.g., from 68 MB to 9 MB!)

4. Optimization 2: Reading Data in Chunks

If the file is simply too big to fit in RAM, you must read it in chunks. You load 100,000 rows, clean them, append the results to a new file, and repeat.

python
12345678910111213141516171819202122232425
# 1. Prepare an empty output file with headers
# (Pseudo-code for the process)

# 2. Create a chunk iterator
chunk_size = 100000
chunk_iterator = pd.read_csv(&#039;massive_file.csv', chunksize=chunk_size)

for i, chunk in enumerate(chunk_iterator):
    print(f"Processing chunk {i}...")
    
    # --- Clean the chunk ---
    # Drop rows missing critical IDs
    clean_chunk = chunk.dropna(subset=[&#039;customer_id'])
    
    # Standardize strings
    clean_chunk[&#039;name'] = clean_chunk['name'].str.upper()
    
    # --- Append the clean chunk to the new file ---
    # mode='a' means append. header=False prevents writing the header 100 times.
    if i == 0:
        clean_chunk.to_csv(&#039;cleaned_massive_file.csv', mode='w', index=False)
    else:
        clean_chunk.to_csv(&#039;cleaned_massive_file.csv', mode='a', index=False, header=False)

print("Batch processing complete.")

5. Optimization 3: Only Load What You Need

If a CSV has 50 columns, but you only need 3 columns for your analysis, do not load the whole file and then drop the columns. Load only the 3 columns.

python
1234
# Load ONLY the 'id', 'date', and 'amount' columns
# This immediately saves massive amounts of RAM
cols_to_keep = [&#039;id', 'date', 'amount']
df_lean = pd.read_csv(&#039;data.csv', usecols=cols_to_keep)

6. Optimization 4: Avoid Loops (Vectorization)

Python for loops are incredibly slow. Pandas is built on C code. Always use vectorized Pandas functions instead of iterating through rows.

python
12345678910111213141516
import time

# Dataset with 1 million rows
df = pd.DataFrame({&#039;price': np.random.rand(1000000) * 100})

# SLOW WAY (Iterating with a for loop)
start = time.time()
discounted_slow = []
for index, row in df.iterrows():
    discounted_slow.append(row[&#039;price'] * 0.9)
print(f"Loop time: {time.time() - start:.2f} seconds") # Might take 50+ seconds

# FAST WAY (Vectorization)
start = time.time()
df[&#039;discounted_fast'] = df['price'] * 0.9
print(f"Vectorized time: {time.time() - start:.5f} seconds") # Takes 0.005 seconds

7. Common Mistakes

  • Using apply() for everything: Beginners discover .apply(lambda x: ...) and use it for everything. apply() is essentially a hidden for loop. It is slow. Use built-in vectorized .str or .dt accessors whenever possible.
  • Not using the category data type: If you have 10 million rows of US States, storing them as strings is a massive waste of RAM. Converting to .astype('category') provides the fastest and easiest memory reduction.

8. MCQs

Question 1

What happens if you try to read a 15GB CSV file into Pandas on a laptop with 8GB of RAM?

Question 2

Converting an int64 column that only contains values between 0 and 100 to int8 does what?

Question 3

How do you convert a repetitive string column (like "Status") to save memory?

Question 4

To process a file that is larger than your RAM, you should use?

Question 5

When writing cleaned chunks back to a CSV, what parameter ensures you don't overwrite the previous chunk?

Question 6

What parameter prevents to_csv from writing the header row 50 times during chunk processing?

Question 7

How do you load only 3 specific columns from a 50-column CSV to save memory?

Question 8

Why should you avoid for index, row in df.iterrows(): for mathematical operations?

Question 9

Vectorized operations in Pandas are fast because they are implemented in?

Question 10

How do you check the true memory usage of a DataFrame, including strings?

9. Interview Questions

  • Q: You have a 20GB dataset but only 8GB of RAM. How do you calculate the total revenue across all rows?
  • Q: Explain 3 ways to optimize memory usage in a Pandas DataFrame.

10. Summary

Large datasets require performance optimization. Read massive files using chunksize and process them iteratively. Save memory instantly on load by using usecols. Once loaded, downcast numeric columns to int8/int32 and convert text columns with low unique values to category. Finally, never use for loops or apply() for math or simple string operations; always use Pandas' built-in, C-optimized vectorized functions.

11. Next Chapter Recommendation

In Chapter 19: Data Cleaning Interview Preparation, we compile the most common technical questions, Python coding tests, and SQL debugging scenarios you will face in Data Analyst and Data Engineering interviews.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·