Data Types and Data Formatting
# CHAPTER 5
Data Types and Data Formatting
1. Chapter Introduction
A column containing salaries might look like numbers, but if someone typed "$50,000", Pandas sees it as text. You cannot calculate an average or train a machine learning model on text strings. This chapter teaches you how to diagnose incorrect data types, clean formatted strings, and convert them into usable numeric, boolean, or categorical formats.2. Understanding Pandas Data Types (dtypes)
Pandas uses specific terms for data types:
-
objectorO: String/Text (or mixed types)
-
int64: Integer (whole numbers)
-
float64: Floating point (decimals, also used when integers have missing values)
-
bool: Boolean (True/False)
-
datetime64: Dates and times
-
category: Categorical data (limited unique values, like "Small", "Medium", "Large")
3. Converting to Numeric Data
The most common task is removing currency symbols and commas so Pandas can convert text to numbers.
4. Handling Boolean Values
Databases and users represent True/False in many ways (1/0, Y/N, Yes/No, T/F).
5. Categorical Data
If a string column has a few repeated values (e.g., Status, Country, Grade), converting it to category saves memory and speeds up operations.
6. The astype() Function vs Helper Functions
-
Use
astype()when the data is perfectly clean but wrong type (e.g., converting integer 1 to float 1.0).
-
Use
pd.to_numeric(),pd.to_datetime()when the data needs smart parsing or you need to handle errors (usingerrors='coerce').
7. Common Mistakes
-
Using
astype(int)when there are missing values (NaN): In Pandas, NaN is technically a float. You cannot convert a column with NaNs toint64. You must either fill the NaNs first, leave it asfloat64, or use the newer nullable integer typeastype('Int64').
-
Forgetting
regex=Trueinstr.replace():df['price'].str.replace('$', '')only replaces exact matches in newer Pandas versions unless you specify it's a literal string or a regex pattern.
8. MCQs
What data type does Pandas use for text/strings?
You have a column of prices like "$1,000". Why does df['price'].astype(float) fail?
What does errors='coerce' do in pd.to_numeric()?
Which method is best for standardizing 'Y'/'N'/'Yes'/'No' to True/False?
Why convert text columns to the category data type?
What happens if you try to astype(int) on a column containing NaN?
To remove all non-numeric characters from a string column, you should use?
What data type is np.nan internally?
Which function safely parses dates in Pandas?
How do you check the data types of all columns in a DataFrame?
9. Interview Questions
- Q: How do you handle a column that contains both numbers as integers and numbers as text strings (e.g., 50 and "50")?
-
Q: Why would you use
pd.to_numeric(errors='coerce')instead ofastype(float)?
10. Summary
Incorrect data types prevent mathematical operations and machine learning. Usestr.replace() to clean strings, then pd.to_numeric() to safely cast them to numbers. Use dictionaries with map() to standardize booleans. Use astype('category') to save memory on repetitive text columns. Always check df.dtypes before and after conversion.