CHAPTER 05 Beginner

Data Types and Data Formatting

Updated: May 18, 2026

5 min read

# CHAPTER 5

Data Types and Data Formatting

1. Chapter Introduction

A column containing salaries might look like numbers, but if someone typed "$50,000", Pandas sees it as text. You cannot calculate an average or train a machine learning model on text strings. This chapter teaches you how to diagnose incorrect data types, clean formatted strings, and convert them into usable numeric, boolean, or categorical formats.

2. Understanding Pandas Data Types (dtypes)

Pandas uses specific terms for data types:

object or O: String/Text (or mixed types)

int64: Integer (whole numbers)

float64: Floating point (decimals, also used when integers have missing values)

bool: Boolean (True/False)

datetime64: Dates and times

category: Categorical data (limited unique values, like "Small", "Medium", "Large")

python

1234567891011121314

import pandas as pd
import numpy as np

# Sample data with bad types
df = pd.DataFrame({
    &#039;id': ['1', '2', '3'],                   # Should be int
    &#039;price': ['$1,200.50', 'Free', '$300'],  # Should be float
    &#039;is_active': ['Y', 'N', 'Yes'],          # Should be bool
    &#039;category': ['A', 'B', 'A']              # Should be category
})

print("=== INITIAL DTYPES ===")
print(df.dtypes)
# Everything is 'object' (string) because of the formatting!

3. Converting to Numeric Data

The most common task is removing currency symbols and commas so Pandas can convert text to numbers.

python

12345678910111213

# 1. Clean the string (remove $ and commas)
# Using regex to replace everything that isn't a digit, minus, or period
df[&#039;price_clean'] = df['price'].str.replace(r'[^\d.-]', '', regex=True)

# 2. Convert to numeric
# errors='coerce' forces invalid parsing (like "Free") to become NaN
df[&#039;price_numeric'] = pd.to_numeric(df['price_clean'], errors='coerce')

print("\n=== AFTER NUMERIC CONVERSION ===")
print(df[[&#039;price', 'price_clean', 'price_numeric']])

# Converting simple strings to integers
df[&#039;id'] = df['id'].astype(int)

4. Handling Boolean Values

Databases and users represent True/False in many ways (1/0, Y/N, Yes/No, T/F).

python

123456789101112131415

# Standardizing boolean indicators
bool_mapping = {
    &#039;Y': True,
    &#039;Yes': True,
    &#039;N': False,
    &#039;No': False,
    &#039;1': True,
    &#039;0': False
}

# Use map() to apply the dictionary
df[&#039;is_active_bool'] = df['is_active'].map(bool_mapping)

print("\n=== AFTER BOOLEAN CONVERSION ===")
print(df[[&#039;is_active', 'is_active_bool']])

5. Categorical Data

If a string column has a few repeated values (e.g., Status, Country, Grade), converting it to category saves memory and speeds up operations.

python

12345678

# Check memory usage
print("\nMemory before category:", df[&#039;category'].memory_usage(deep=True))

# Convert to category
df[&#039;category'] = df['category'].astype('category')

print("Memory after category: ", df[&#039;category'].memory_usage(deep=True))
# The larger the dataset, the bigger the memory savings!

6. The astype() Function vs Helper Functions

Use astype() when the data is perfectly clean but wrong type (e.g., converting integer 1 to float 1.0).

Use pd.to_numeric(), pd.to_datetime() when the data needs smart parsing or you need to handle errors (using errors='coerce').

python

12345

# Astype will FAIL if it hits bad data:
# df['price'].astype(float) -> ValueError: could not convert string to float: '$1,200.50'

# to_numeric with coerce handles it gracefully:
# pd.to_numeric('bad_data', errors='coerce') -> NaN

7. Common Mistakes

Using astype(int) when there are missing values (NaN): In Pandas, NaN is technically a float. You cannot convert a column with NaNs to int64. You must either fill the NaNs first, leave it as float64, or use the newer nullable integer type astype('Int64').

Forgetting regex=True in str.replace(): df['price'].str.replace('$', '') only replaces exact matches in newer Pandas versions unless you specify it's a literal string or a regex pattern.

8. MCQs

Question 1

What data type does Pandas use for text/strings?

Question 2

You have a column of prices like "$1,000". Why does `df['price'].astype(float)` fail?

Question 3

What does `errors='coerce'` do in `pd.to_numeric()`?

Question 4

Which method is best for standardizing 'Y'/'N'/'Yes'/'No' to True/False?

Question 5

Why convert text columns to the `category` data type?

Question 6

What happens if you try to `astype(int)` on a column containing `NaN`?

Question 7

To remove all non-numeric characters from a string column, you should use?

Question 8

What data type is `np.nan` internally?

Question 9

Which function safely parses dates in Pandas?

Question 10

How do you check the data types of all columns in a DataFrame?

9. Interview Questions

Q: How do you handle a column that contains both numbers as integers and numbers as text strings (e.g., 50 and "50")?

Q: Why would you use pd.to_numeric(errors='coerce') instead of astype(float)?

10. Summary

Incorrect data types prevent mathematical operations and machine learning. Use str.replace() to clean strings, then pd.to_numeric() to safely cast them to numbers. Use dictionaries with map() to standardize booleans. Use astype('category') to save memory on repetitive text columns. Always check df.dtypes before and after conversion.

11. Next Chapter Recommendation

In Chapter 6: Handling Missing Values, we tackle one of the most common and complex data cleaning tasks: what to do when data is simply not there.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Data Types and Data Formatting #

1. Chapter Introduction #

2. Understanding Pandas Data Types (dtypes) #

3. Converting to Numeric Data #

4. Handling Boolean Values #

5. Categorical Data #

6. The astype() Function vs Helper Functions #

7. Common Mistakes #

8. MCQs #

What data type does Pandas use for text/strings?

You have a column of prices like "$1,000". Why does df['price'].astype(float) fail?

What does errors='coerce' do in pd.to_numeric()?

Which method is best for standardizing 'Y'/'N'/'Yes'/'No' to True/False?

Why convert text columns to the category data type?

What happens if you try to astype(int) on a column containing NaN?

To remove all non-numeric characters from a string column, you should use?

What data type is np.nan internally?

Which function safely parses dates in Pandas?

How do you check the data types of all columns in a DataFrame?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Data Types and Data Formatting

1. Chapter Introduction

2. Understanding Pandas Data Types (dtypes)

3. Converting to Numeric Data

4. Handling Boolean Values

5. Categorical Data

6. The astype() Function vs Helper Functions

7. Common Mistakes

8. MCQs

You have a column of prices like "$1,000". Why does `df['price'].astype(float)` fail?

What does `errors='coerce'` do in `pd.to_numeric()`?

Why convert text columns to the `category` data type?

What happens if you try to `astype(int)` on a column containing `NaN`?

What data type is `np.nan` internally?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation