Skip to main content
Pandas & NumPy
CHAPTER 24 Beginner

Working with Large Datasets

Updated: May 18, 2026
5 min read

# CHAPTER 24

Working with Large Datasets

1. Chapter Introduction

When datasets exceed available RAM, standard Pandas fails. This chapter covers chunked processing, dtype optimization, efficient data formats, and Dask — enabling analysis of GB-scale datasets on any machine.

2. Memory Assessment

python
12345678910111213141516171819202122232425262728
import pandas as pd
import numpy as np

def memory_report(df, name="DataFrame"):
    """Show memory usage per column and total."""
    mem = df.memory_usage(deep=True)
    mem_mb = mem / 1024**2
    total_mb = mem_mb.sum()
    print(f"\n{name} Memory Report:")
    for col in df.columns:
        pct = mem_mb[col] / total_mb * 100
        print(f"  {col:<20} {mem_mb[col]:.2f} MB ({pct:.1f}%)")
    print(f"  {&#039;TOTAL':<20} {total_mb:.2f} MB")
    return total_mb

# Simulate large DataFrame
n = 1_000_000
df = pd.DataFrame({
    &#039;user_id': np.random.randint(1, 100000, n),
    &#039;product_id': np.random.randint(1, 10000, n),
    &#039;quantity': np.random.randint(1, 20, n),
    &#039;price': np.random.uniform(1, 500, n),
    &#039;category': np.random.choice(['Electronics','Furniture','Clothing','Food'], n),
    &#039;region': np.random.choice(['North','South','East','West'], n),
    &#039;date': pd.date_range('2024-01-01', periods=n, freq='s')
})

original_mb = memory_report(df, "Original (1M rows)")

3. Dtype Optimization

python
12345678910111213141516171819202122232425262728
def optimize_dtypes(df):
    """Automatically reduce memory by optimizing dtypes."""
    df_opt = df.copy()

    for col in df_opt.select_dtypes(include=[&#039;int64']).columns:
        col_min = df_opt[col].min()
        col_max = df_opt[col].max()
        if col_min >= 0:
            if col_max < 255: df_opt[col] = df_opt[col].astype(np.uint8)
            elif col_max < 65535: df_opt[col] = df_opt[col].astype(np.uint16)
            elif col_max < 4294967295: df_opt[col] = df_opt[col].astype(np.uint32)
        else:
            if col_min > -128 and col_max < 127: df_opt[col] = df_opt[col].astype(np.int8)
            elif col_min > -32768 and col_max < 32767: df_opt[col] = df_opt[col].astype(np.int16)
            elif col_min > -2147483648 and col_max < 2147483647: df_opt[col] = df_opt[col].astype(np.int32)

    for col in df_opt.select_dtypes(include=[&#039;float64']).columns:
        df_opt[col] = df_opt[col].astype(np.float32)

    for col in df_opt.select_dtypes(include=[&#039;object']).columns:
        if df_opt[col].nunique() / len(df_opt) < 0.5:  # Less than 50% unique
            df_opt[col] = df_opt[col].astype(&#039;category')

    return df_opt

df_opt = optimize_dtypes(df)
optimized_mb = memory_report(df_opt, "Optimized")
print(f"\nMemory reduction: {(1 - optimized_mb/original_mb)*100:.1f}%")

4. Chunked Processing

python
12345678910111213141516171819202122232425262728
# Process large CSV in chunks without loading entire file
chunk_size = 100_000  # Process 100K rows at a time

# Save our large df to CSV first
df.to_csv(&#039;large_data.csv', index=False)

# Pattern 1: Collect results from each chunk
chunk_results = []

for chunk in pd.read_csv(&#039;large_data.csv', chunksize=chunk_size):
    # Process each chunk
    chunk_summary = chunk.groupby(&#039;category')['price'].agg(['sum', 'count'])
    chunk_results.append(chunk_summary)

# Combine chunk results
final = pd.concat(chunk_results).groupby(level=0).sum()
final[&#039;avg_price'] = final['sum'] / final['count']
print("Category Revenue (chunked processing):")
print(final)

# Pattern 2: Filter and collect
high_value_chunks = []
for chunk in pd.read_csv(&#039;large_data.csv', chunksize=chunk_size):
    high_value = chunk[chunk[&#039;price'] > 400]
    high_value_chunks.append(high_value)

high_value_df = pd.concat(high_value_chunks, ignore_index=True)
print(f"\nHigh-value transactions: {len(high_value_df):,}")

5. Efficient File Formats

python
1234567891011121314151617181920212223242526272829
# Parquet — columnar format, much faster than CSV for analytics
df_sample = df.head(100_000)

# Write and read comparison
import time

# CSV
t = time.time()
df_sample.to_csv(&#039;data.csv', index=False)
csv_write = time.time() - t

t = time.time()
pd.read_csv(&#039;data.csv')
csv_read = time.time() - t

# Parquet
t = time.time()
df_sample.to_parquet(&#039;data.parquet', index=False)
parq_write = time.time() - t

t = time.time()
pd.read_parquet(&#039;data.parquet')
parq_read = time.time() - t

print(f"CSV:     write={csv_write:.2f}s, read={csv_read:.2f}s")
print(f"Parquet: write={parq_write:.2f}s, read={parq_read:.2f}s")

# Parquet with column selection (only loads needed columns)
df_partial = pd.read_parquet(&#039;data.parquet', columns=['category', 'price'])

6. Common Mistakes

  • Loading entire CSV into memory: Always inspect file size before loading. Use nrows=1000 first to check structure.
  • Using CSV for repeated analysis: CSV re-parses from scratch each time. Save to Parquet for 5-10x faster subsequent loads.

7. MCQs

Question 1

pd.read_csv(chunksize=10000) returns?

Question 2

Parquet format advantage over CSV?

Question 3

uint8 dtype stores integers?

Question 4

Categorical dtype saves memory because?

Question 5

df.memory_usage(deep=True) includes?

Question 6

Best practice when loading an unknown large CSV?

Question 7

Parquet column selection columns=['A','B'] loads?

Question 8

float32 vs float64 precision?

Question 9

pd.concat(chunk_results) after chunked processing?

Question 10

Memory reduction through dtype optimization typically achieves?

8. Interview Questions

  • Q: How do you process a 10GB CSV file on a machine with only 8GB RAM?
  • Q: Why is Parquet preferred over CSV for analytical workloads?

9. Summary

Large dataset strategies: dtype optimization (50-80% memory savings), chunked processing via chunksize, Parquet format (5-10x faster I/O), and column selection. The combination enables GB-scale analysis on standard hardware without Dask.

10. Next Chapter Recommendation

In Chapter 25: Pandas with SQL Databases, we integrate Pandas with MySQL and SQLite for database-driven analysis.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·