CHAPTER 15 Beginner

Data Transformation and Manipulation

Updated: May 18, 2026

5 min read

# CHAPTER 15

Data Transformation and Manipulation

1. Chapter Introduction

Raw data rarely comes in the exact form needed for analysis. Transformation — sorting, applying custom logic, mapping values, creating derived features — converts raw data into analysis-ready datasets.

2. Sorting

python

1234567891011121314151617181920212223

import pandas as pd

df = pd.DataFrame({
    &#039;Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
    &#039;Dept': ['Eng', 'Mkt', 'Eng', 'Sales', 'HR'],
    &#039;Salary': [85000, 62000, 91000, 55000, 61000],
    &#039;Experience': [5, 3, 8, 2, 3]
})

# Sort by single column
df_sorted = df.sort_values(&#039;Salary', ascending=False)
print(df_sorted)

# Sort by multiple columns
df_multi = df.sort_values([&#039;Dept', 'Salary'], ascending=[True, False])
print(df_multi)

# Sort by index
df_idx = df.sort_index(ascending=False)

# Rank values
df[&#039;Salary_Rank'] = df['Salary'].rank(ascending=False, method='min').astype(int)
print(df[[&#039;Name', 'Salary', 'Salary_Rank']])

3. apply() — Custom Functions

python

1234567891011121314151617181920

# apply on a Series
df[&#039;Salary_K'] = df['Salary'].apply(lambda x: f"${x/1000:.0f}K")

# apply with named function
def categorize_salary(salary):
    if salary >= 85000: return &#039;High'
    elif salary >= 65000: return &#039;Mid'
    else: return &#039;Low'

df[&#039;Salary_Cat'] = df['Salary'].apply(categorize_salary)

# apply on a DataFrame row (axis=1)
def performance_score(row):
    exp_bonus = row[&#039;Experience'] * 2000
    if row[&#039;Dept'] == 'Eng':
        return (row[&#039;Salary'] + exp_bonus) * 1.1
    return row[&#039;Salary'] + exp_bonus

df[&#039;Score'] = df.apply(performance_score, axis=1)
print(df)

4. map() — Value Mapping

python

1234567891011121314

# map — replace values using dict or function (Series only)
dept_full = {&#039;Eng': 'Engineering', 'Mkt': 'Marketing',
             &#039;Sales': 'Sales', 'HR': 'Human Resources'}
df[&#039;Department'] = df['Dept'].map(dept_full)

# map with function
df[&#039;Salary_Log'] = df['Salary'].map(lambda x: round(x / 1000, 1))

# replace() — map with fallback for unmatched values
df[&#039;Dept_Code'] = df['Dept'].replace({'Eng': 'E', 'Mkt': 'M', 'Sales': 'S', 'HR': 'H'})

# Note: map() returns NaN for unmapped values
# replace() keeps original value for unmapped
print(df[[&#039;Name', 'Dept', 'Department', 'Dept_Code']])

5. String Transformations

python

123456789101112131415161718192021

df = pd.DataFrame({
    &#039;Product': ['  Laptop PRO ', 'PHONE x12', 'desk chair', 'Monitor 27"'],
    &#039;Code': ['LP-001', 'PH-012', 'DC-003', 'MN-027'],
    &#039;Description': ['High-end laptop; 16GB RAM; SSD', 'Smartphone; 128GB; 5G',
                     &#039;Ergonomic chair; adjustable', 'Wide screen; 4K; HDMI']
})

# String cleaning
df[&#039;Product'] = df['Product'].str.strip().str.title()

# Extract first word
df[&#039;Category'] = df['Product'].str.split().str[0]

# String operations
df[&#039;Code_Upper'] = df['Code'].str.upper()
df[&#039;Code_Num'] = df['Code'].str.extract(r'(\d+)').astype(int)

# Split into multiple columns
df[[&#039;Desc1', 'Desc2', 'Desc3']] = df['Description'].str.split(';', expand=True)

print(df)

6. Applying Multiple Transformations (Pipeline)

python

123456789101112131415161718192021

# Method chaining for readable transformation pipelines
result = (
    pd.DataFrame({
        &#039;Name':   ['  alice SMITH', 'BOB jones ', 'CAROL white'],
        &#039;Salary': ['$55,000', '$72,000', '$88,000'],
        &#039;Age':    ['28', '35', '42']
    })
    .assign(
        Name=lambda df: df[&#039;Name'].str.strip().str.title(),
        Salary=lambda df: df[&#039;Salary'].str.replace('[$,]', '', regex=True).astype(float),
        Age=lambda df: df[&#039;Age'].astype(int),
        Tax=lambda df: df[&#039;Salary'] * 0.25,
        Net=lambda df: df[&#039;Salary'] * 0.75,
        Level=lambda df: pd.cut(df[&#039;Salary'],
                                bins=[0, 60000, 80000, float(&#039;inf')],
                                labels=[&#039;Junior', 'Mid', 'Senior'])
    )
    .sort_values(&#039;Salary', ascending=False)
    .reset_index(drop=True)
)
print(result)

7. Common Mistakes

map() vs apply() on DataFrame: map() works on Series only. For row/column operations on a DataFrame, use apply().

apply() with axis=1 is slow: For large DataFrames, vectorized operations are much faster. Use apply(axis=1) only when vectorization isn't possible.

8. MCQs

Question 1

`df['col'].apply(lambda x: x*2)` applies?

Question 2

`df.apply(func, axis=1)` applies func to?

Question 3

`map({'A': 1, 'B': 2})` for unmapped values returns?

Question 4

`replace({'A': 1})` for unmapped values returns?

Question 5

`sort_values(['A','B'], ascending=[True, False])`?

Question 6

`str.split(';', expand=True)` returns?

Question 7

`str.extract(r'(\d+)')` returns?

Question 8

`assign()` in method chaining is for?

Question 9

`df['col'].rank(method='min')` for ties?

Question 10

Method chaining advantage?

9. Interview Questions

Q: What is the difference between map(), apply(), and applymap() in Pandas?

Q: How do you chain multiple transformations in Pandas?

10. Summary

Transformation toolkit: sort_values() for ordering, apply() for custom functions row/column-wise, map() for value replacement, string accessor str.* for text operations, assign() for clean method chaining. Always prefer vectorized operations over apply(axis=1) for performance.

11. Next Chapter Recommendation

In Chapter 16: GroupBy and Aggregation, we master SQL-style GROUP BY operations, aggregations, and pivot-style analytics.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Data Transformation and Manipulation #

1. Chapter Introduction #

2. Sorting #

3. apply() — Custom Functions #

4. map() — Value Mapping #

5. String Transformations #

6. Applying Multiple Transformations (Pipeline) #

7. Common Mistakes #

8. MCQs #

df['col'].apply(lambda x: x*2) applies?

df.apply(func, axis=1) applies func to?

map({'A': 1, 'B': 2}) for unmapped values returns?

replace({'A': 1}) for unmapped values returns?

sort_values(['A','B'], ascending=[True, False])?

str.split(';', expand=True) returns?

str.extract(r'(\d+)') returns?

assign() in method chaining is for?

df['col'].rank(method='min') for ties?

Method chaining advantage?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!

Data Transformation and Manipulation

1. Chapter Introduction

2. Sorting

3. apply() — Custom Functions

4. map() — Value Mapping

5. String Transformations

6. Applying Multiple Transformations (Pipeline)

7. Common Mistakes

8. MCQs

`df['col'].apply(lambda x: x*2)` applies?

`df.apply(func, axis=1)` applies func to?

`map({'A': 1, 'B': 2})` for unmapped values returns?

`replace({'A': 1})` for unmapped values returns?

`sort_values(['A','B'], ascending=[True, False])`?

`str.split(';', expand=True)` returns?

`str.extract(r'(\d+)')` returns?

`assign()` in method chaining is for?

`df['col'].rank(method='min')` for ties?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation