Skip to main content
Pandas & NumPy
CHAPTER 15 Beginner

Data Transformation and Manipulation

Updated: May 18, 2026
5 min read

# CHAPTER 15

Data Transformation and Manipulation

1. Chapter Introduction

Raw data rarely comes in the exact form needed for analysis. Transformation — sorting, applying custom logic, mapping values, creating derived features — converts raw data into analysis-ready datasets.

2. Sorting

python
1234567891011121314151617181920212223
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
    'Dept': ['Eng', 'Mkt', 'Eng', 'Sales', 'HR'],
    'Salary': [85000, 62000, 91000, 55000, 61000],
    'Experience': [5, 3, 8, 2, 3]
})

# Sort by single column
df_sorted = df.sort_values('Salary', ascending=False)
print(df_sorted)

# Sort by multiple columns
df_multi = df.sort_values(['Dept', 'Salary'], ascending=[True, False])
print(df_multi)

# Sort by index
df_idx = df.sort_index(ascending=False)

# Rank values
df['Salary_Rank'] = df['Salary'].rank(ascending=False, method='min').astype(int)
print(df[['Name', 'Salary', 'Salary_Rank']])

3. apply() — Custom Functions

python
1234567891011121314151617181920
# apply on a Series
df['Salary_K'] = df['Salary'].apply(lambda x: f"${x/1000:.0f}K")

# apply with named function
def categorize_salary(salary):
    if salary >= 85000: return 'High'
    elif salary >= 65000: return 'Mid'
    else: return 'Low'

df['Salary_Cat'] = df['Salary'].apply(categorize_salary)

# apply on a DataFrame row (axis=1)
def performance_score(row):
    exp_bonus = row['Experience'] * 2000
    if row['Dept'] == 'Eng':
        return (row['Salary'] + exp_bonus) * 1.1
    return row['Salary'] + exp_bonus

df['Score'] = df.apply(performance_score, axis=1)
print(df)

4. map() — Value Mapping

python
1234567891011121314
# map — replace values using dict or function (Series only)
dept_full = {'Eng': 'Engineering', 'Mkt': 'Marketing',
             'Sales': 'Sales', 'HR': 'Human Resources'}
df['Department'] = df['Dept'].map(dept_full)

# map with function
df['Salary_Log'] = df['Salary'].map(lambda x: round(x / 1000, 1))

# replace() — map with fallback for unmatched values
df['Dept_Code'] = df['Dept'].replace({'Eng': 'E', 'Mkt': 'M', 'Sales': 'S', 'HR': 'H'})

# Note: map() returns NaN for unmapped values
# replace() keeps original value for unmapped
print(df[['Name', 'Dept', 'Department', 'Dept_Code']])

5. String Transformations

python
123456789101112131415161718192021
df = pd.DataFrame({
    'Product': ['  Laptop PRO ', 'PHONE x12', 'desk chair', 'Monitor 27"'],
    'Code': ['LP-001', 'PH-012', 'DC-003', 'MN-027'],
    'Description': ['High-end laptop; 16GB RAM; SSD', 'Smartphone; 128GB; 5G',
                     'Ergonomic chair; adjustable', 'Wide screen; 4K; HDMI']
})

# String cleaning
df['Product'] = df['Product'].str.strip().str.title()

# Extract first word
df['Category'] = df['Product'].str.split().str[0]

# String operations
df['Code_Upper'] = df['Code'].str.upper()
df['Code_Num'] = df['Code'].str.extract(r'(\d+)').astype(int)

# Split into multiple columns
df[['Desc1', 'Desc2', 'Desc3']] = df['Description'].str.split(';', expand=True)

print(df)

6. Applying Multiple Transformations (Pipeline)

python
123456789101112131415161718192021
# Method chaining for readable transformation pipelines
result = (
    pd.DataFrame({
        'Name':   ['  alice SMITH', 'BOB jones ', 'CAROL white'],
        'Salary': ['$55,000', '$72,000', '$88,000'],
        'Age':    ['28', '35', '42']
    })
    .assign(
        Name=lambda df: df['Name'].str.strip().str.title(),
        Salary=lambda df: df['Salary'].str.replace('[$,]', '', regex=True).astype(float),
        Age=lambda df: df['Age'].astype(int),
        Tax=lambda df: df['Salary'] * 0.25,
        Net=lambda df: df['Salary'] * 0.75,
        Level=lambda df: pd.cut(df['Salary'],
                                bins=[0, 60000, 80000, float('inf')],
                                labels=['Junior', 'Mid', 'Senior'])
    )
    .sort_values('Salary', ascending=False)
    .reset_index(drop=True)
)
print(result)

7. Common Mistakes

  • map() vs apply() on DataFrame: map() works on Series only. For row/column operations on a DataFrame, use apply().
  • apply() with axis=1 is slow: For large DataFrames, vectorized operations are much faster. Use apply(axis=1) only when vectorization isn't possible.

8. MCQs

Question 1

df['col'].apply(lambda x: x*2) applies?

Question 2

df.apply(func, axis=1) applies func to?

Question 3

map({'A': 1, 'B': 2}) for unmapped values returns?

Question 4

replace({'A': 1}) for unmapped values returns?

Question 5

sort_values(['A','B'], ascending=[True, False])?

Question 6

str.split(';', expand=True) returns?

Question 7

str.extract(r'(\d+)') returns?

Question 8

assign() in method chaining is for?

Question 9

df['col'].rank(method='min') for ties?

Question 10

Method chaining advantage?

9. Interview Questions

  • Q: What is the difference between map(), apply(), and applymap() in Pandas?
  • Q: How do you chain multiple transformations in Pandas?

10. Summary

Transformation toolkit: sort_values() for ordering, apply() for custom functions row/column-wise, map() for value replacement, string accessor str.* for text operations, assign() for clean method chaining. Always prefer vectorized operations over apply(axis=1) for performance.

11. Next Chapter Recommendation

In Chapter 16: GroupBy and Aggregation, we master SQL-style GROUP BY operations, aggregations, and pivot-style analytics.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·