Skip to main content
Data Cleaning
CHAPTER 05 Beginner

Data Types and Data Formatting

Updated: May 18, 2026
5 min read

# CHAPTER 5

Data Types and Data Formatting

1. Chapter Introduction

A column containing salaries might look like numbers, but if someone typed "$50,000", Pandas sees it as text. You cannot calculate an average or train a machine learning model on text strings. This chapter teaches you how to diagnose incorrect data types, clean formatted strings, and convert them into usable numeric, boolean, or categorical formats.

2. Understanding Pandas Data Types (dtypes)

Pandas uses specific terms for data types:

  • object or O: String/Text (or mixed types)
  • int64: Integer (whole numbers)
  • float64: Floating point (decimals, also used when integers have missing values)
  • bool: Boolean (True/False)
  • datetime64: Dates and times
  • category: Categorical data (limited unique values, like "Small", "Medium", "Large")

python
1234567891011121314
import pandas as pd
import numpy as np

# Sample data with bad types
df = pd.DataFrame({
    'id': ['1', '2', '3'],                   # Should be int
    'price': ['$1,200.50', 'Free', '$300'],  # Should be float
    'is_active': ['Y', 'N', 'Yes'],          # Should be bool
    'category': ['A', 'B', 'A']              # Should be category
})

print("=== INITIAL DTYPES ===")
print(df.dtypes)
# Everything is 'object' (string) because of the formatting!

3. Converting to Numeric Data

The most common task is removing currency symbols and commas so Pandas can convert text to numbers.

python
12345678910111213
# 1. Clean the string (remove $ and commas)
# Using regex to replace everything that isn't a digit, minus, or period
df['price_clean'] = df['price'].str.replace(r'[^\d.-]', '', regex=True)

# 2. Convert to numeric
# errors='coerce' forces invalid parsing (like "Free") to become NaN
df['price_numeric'] = pd.to_numeric(df['price_clean'], errors='coerce')

print("\n=== AFTER NUMERIC CONVERSION ===")
print(df[['price', 'price_clean', 'price_numeric']])

# Converting simple strings to integers
df['id'] = df['id'].astype(int)

4. Handling Boolean Values

Databases and users represent True/False in many ways (1/0, Y/N, Yes/No, T/F).

python
123456789101112131415
# Standardizing boolean indicators
bool_mapping = {
    'Y': True,
    'Yes': True,
    'N': False,
    'No': False,
    '1': True,
    '0': False
}

# Use map() to apply the dictionary
df['is_active_bool'] = df['is_active'].map(bool_mapping)

print("\n=== AFTER BOOLEAN CONVERSION ===")
print(df[['is_active', 'is_active_bool']])

5. Categorical Data

If a string column has a few repeated values (e.g., Status, Country, Grade), converting it to category saves memory and speeds up operations.

python
12345678
# Check memory usage
print("\nMemory before category:", df['category'].memory_usage(deep=True))

# Convert to category
df['category'] = df['category'].astype('category')

print("Memory after category: ", df['category'].memory_usage(deep=True))
# The larger the dataset, the bigger the memory savings!

6. The astype() Function vs Helper Functions

  • Use astype() when the data is perfectly clean but wrong type (e.g., converting integer 1 to float 1.0).
  • Use pd.to_numeric(), pd.to_datetime() when the data needs smart parsing or you need to handle errors (using errors='coerce').
python
12345
# Astype will FAIL if it hits bad data:
# df['price'].astype(float) -> ValueError: could not convert string to float: '$1,200.50'

# to_numeric with coerce handles it gracefully:
# pd.to_numeric('bad_data', errors='coerce') -> NaN

7. Common Mistakes

  • Using astype(int) when there are missing values (NaN): In Pandas, NaN is technically a float. You cannot convert a column with NaNs to int64. You must either fill the NaNs first, leave it as float64, or use the newer nullable integer type astype('Int64').
  • Forgetting regex=True in str.replace(): df['price'].str.replace('$', '') only replaces exact matches in newer Pandas versions unless you specify it's a literal string or a regex pattern.

8. MCQs

Question 1

What data type does Pandas use for text/strings?

Question 2

You have a column of prices like "$1,000". Why does df['price'].astype(float) fail?

Question 3

What does errors='coerce' do in pd.to_numeric()?

Question 4

Which method is best for standardizing 'Y'/'N'/'Yes'/'No' to True/False?

Question 5

Why convert text columns to the category data type?

Question 6

What happens if you try to astype(int) on a column containing NaN?

Question 7

To remove all non-numeric characters from a string column, you should use?

Question 8

What data type is np.nan internally?

Question 9

Which function safely parses dates in Pandas?

Question 10

How do you check the data types of all columns in a DataFrame?

9. Interview Questions

  • Q: How do you handle a column that contains both numbers as integers and numbers as text strings (e.g., 50 and "50")?
  • Q: Why would you use pd.to_numeric(errors='coerce') instead of astype(float)?

10. Summary

Incorrect data types prevent mathematical operations and machine learning. Use str.replace() to clean strings, then pd.to_numeric() to safely cast them to numbers. Use dictionaries with map() to standardize booleans. Use astype('category') to save memory on repetitive text columns. Always check df.dtypes before and after conversion.

11. Next Chapter Recommendation

In Chapter 6: Handling Missing Values, we tackle one of the most common and complex data cleaning tasks: what to do when data is simply not there.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·