CHAPTER 14
Beginner
Data Cleaning in R
Updated: May 18, 2026
5 min read
# CHAPTER 14
Data Cleaning in R
1. Chapter Introduction
"Garbage in, garbage out" — real data is always messy. Missing values, duplicates, wrong data types, inconsistent formats. This chapter builds a systematic data cleaning pipeline covering every major cleaning task.2. Data Quality Assessment
r
3. Handling Missing Values
r
4. Removing Duplicates and Fixing Types
r
5. Mini Project: Customer Data Cleaner
r
6. Common Mistakes
- Mean imputation causing bias: Mean imputation shrinks variance artificially. For analysis beyond basic reporting, consider median imputation or model-based imputation (mice package).
-
Removing all NAs with
na.omit(): If 30% of rows have at least one NA in any column,na.omit()removes 30% of your data — possibly creating selection bias. Be targeted.
7. MCQs
Question 1
complete.cases(df) returns?
Question 2
distinct(id, .keep_all=TRUE) keeps?
Question 3
fill(.direction="down") imputes by?
Question 4
sum(is.na(df)) counts?
Question 5
Median imputation advantage over mean?
Question 6
gsub("[,$]", "", "$1,234") returns?
Question 7
replace_na(list(col=0)) replaces?
Question 8
str_to_title("ALICE JOHNSON") returns?
Question 9
Setting impossible ages to NA then imputing is?
Question 10
duplicated(df) returns?
8. Interview Questions
- Q: What strategies do you use to handle missing data in R?
- Q: How do you detect and remove duplicate records in a data frame?
9. Summary
Data cleaning pipeline: assess (sum(is.na()), duplicated()), remove (distinct(), na.omit()), impute (replace_na(), mean/median/mode), fix types (as.Date(), as.numeric()), standardize strings (str_to_title(), trimws()), validate domain rules (impossible ages/salaries → NA). Always document cleaning steps for reproducibility.