Skip to main content
R Programming
CHAPTER 14 Beginner

Data Cleaning in R

Updated: May 18, 2026
5 min read

# CHAPTER 14

Data Cleaning in R

1. Chapter Introduction

"Garbage in, garbage out" — real data is always messy. Missing values, duplicates, wrong data types, inconsistent formats. This chapter builds a systematic data cleaning pipeline covering every major cleaning task.

2. Data Quality Assessment

r
1234567891011121314151617181920212223242526272829303132333435
library(dplyr)
library(tidyr)

# Quick data quality check
quality_check <- function(df) {
  cat("=== DATA QUALITY REPORT ===\n")
  cat("Dimensions:", nrow(df), "rows ×", ncol(df), "columns\n\n")

  cat("Missing Values:\n")
  na_counts <- colSums(is.na(df))
  na_pct    <- round(na_counts / nrow(df) * 100, 1)
  for (col in names(na_counts)) {
    if (na_counts[col] > 0)
      cat(sprintf("  %-20s %4d (%.1f%%)\n", col, na_counts[col], na_pct[col]))
  }

  cat("\nDuplicate Rows:", sum(duplicated(df)), "\n")
  cat("\nData Types:\n")
  for (col in names(df)) cat(sprintf("  %-20s %s\n", col, class(df[[col]])))
}

# Generate messy sample data
set.seed(42)
customer_data <- data.frame(
  id       = c(1:20, 5, 12),             # Duplicates
  name     = c(paste0("Customer_", 1:18), NA, "BOB SMITH", "Customer_5", "Customer_12"),
  age      = c(sample(18:65, 18), -5, 200, 28, 45),  # Impossible ages
  email    = c(replicate(15, paste0(sample(letters,6,T), collapse=""), "@test.com"),
               NA, NA, "invalid", "a@b.com", "@bad.com", NA, "ok@test.com"),
  salary   = c(round(runif(17, 30000, 120000)), NA, NA, NA, -1000, 99999999),
  category = c(sample(c("Gold","Silver","Bronze",NA), 22, replace=TRUE)),
  stringsAsFactors = FALSE
)

quality_check(customer_data)

3. Handling Missing Values

r
1234567891011121314151617181920212223242526272829
# Detecting NAs
is.na(df$salary)              # Logical vector
sum(is.na(df$salary))         # Count NAs
complete.cases(df)            # TRUE if row has no NAs
df[complete.cases(df), ]      # Remove all rows with any NA

# Removing NAs
df_clean <- na.omit(df)       # Remove rows with any NA
df %>% drop_na(salary, age)   # Remove rows with NA in specific cols
df %>% filter(!is.na(salary)) # Same with dplyr

# Imputing NAs (replacing with values)
# Mean imputation (numeric)
df$salary[is.na(df$salary)] <- mean(df$salary, na.rm=TRUE)

# Median imputation (robust to outliers)
df$age[is.na(df$age)] <- median(df$age, na.rm=TRUE)

# Mode imputation (categorical)
mode_val <- names(sort(table(df$category), decreasing=TRUE))[1]
df$category[is.na(df$category)] <- mode_val

# Forward fill (time series)
library(tidyr)
df %>% fill(salary, .direction="down")  # Fill from previous row
df %>% fill(salary, .direction="up")    # Fill from next row

# tidyr::replace_na()
df %>% replace_na(list(salary=0, category="Unknown"))

4. Removing Duplicates and Fixing Types

r
123456789101112131415161718192021
# Remove exact duplicates
df_unique <- df[!duplicated(df), ]
df %>% distinct()  # tidyverse

# Remove duplicates by key column
df %>% distinct(id, .keep_all=TRUE)  # Keep first occurrence per id

# Fix data types
df$salary  <- as.numeric(gsub("[,$]", "", df$salary))  # Remove $ and , from "$85,000"
df$date    <- as.Date(df$date, format="%m/%d/%Y")       # Parse dates
df$id      <- as.integer(df$id)
df$category <- as.factor(df$category)

# Standardize strings
df$name  <- str_to_title(trimws(df$name))  # "  BOB SMITH  " → "Bob Smith"
df$email <- tolower(trimws(df$email))

# Fix impossible values (outliers)
df$age[df$age < 0 | df$age > 120] <- NA    # Impossible ages → NA
df$salary[df$salary < 0] <- NA              # Negative salary → NA
df$salary[df$salary > 500000] <- NA         # Extreme outliers → NA

5. Mini Project: Customer Data Cleaner

r
1234567891011121314151617181920212223242526272829303132333435363738394041
library(dplyr); library(stringr); library(tidyr); library(readr)

clean_customer_data <- function(df) {
  cat("Starting: ", nrow(df), "rows\n")

  df <- df %>%
    # Remove complete duplicates
    distinct() %>%
    # Remove duplicate IDs (keep first)
    distinct(id, .keep_all=TRUE) %>%
    # Fix names
    mutate(name = str_to_title(trimws(name))) %>%
    # Validate ages
    mutate(age = ifelse(age < 18 | age > 100, NA, age)) %>%
    # Fix salaries
    mutate(salary = as.numeric(gsub("[,$]", "", as.character(salary))),
           salary = ifelse(salary < 0 | salary > 1000000, NA, salary)) %>%
    # Validate emails
    mutate(email = tolower(trimws(email)),
           email = ifelse(str_detect(email, "^[a-z0-9.]+@[a-z0-9.]+\\.[a-z]{2,}$"),
                          email, NA)) %>%
    # Impute missing values
    mutate(
      age    = ifelse(is.na(age), round(median(age, na.rm=TRUE)), age),
      salary = ifelse(is.na(salary), round(mean(salary, na.rm=TRUE)), salary),
      category = ifelse(is.na(category), "Unknown", category)
    ) %>%
    # Remove rows still missing critical fields
    filter(!is.na(name), !is.na(id)) %>%
    # Sort
    arrange(id)

  cat("After cleaning:", nrow(df), "rows\n")
  cat("Missing values remaining:", sum(is.na(df)), "\n")
  df
}

# Run cleaner
cleaned <- clean_customer_data(customer_data)
write_csv(cleaned, "cleaned_customers.csv")
cat("Saved clean data!\n")

6. Common Mistakes

  • Mean imputation causing bias: Mean imputation shrinks variance artificially. For analysis beyond basic reporting, consider median imputation or model-based imputation (mice package).
  • Removing all NAs with na.omit(): If 30% of rows have at least one NA in any column, na.omit() removes 30% of your data — possibly creating selection bias. Be targeted.

7. MCQs

Question 1

complete.cases(df) returns?

Question 2

distinct(id, .keep_all=TRUE) keeps?

Question 3

fill(.direction="down") imputes by?

Question 4

sum(is.na(df)) counts?

Question 5

Median imputation advantage over mean?

Question 6

gsub("[,$]", "", "$1,234") returns?

Question 7

replace_na(list(col=0)) replaces?

Question 8

str_to_title("ALICE JOHNSON") returns?

Question 9

Setting impossible ages to NA then imputing is?

Question 10

duplicated(df) returns?

8. Interview Questions

  • Q: What strategies do you use to handle missing data in R?
  • Q: How do you detect and remove duplicate records in a data frame?

9. Summary

Data cleaning pipeline: assess (sum(is.na()), duplicated()), remove (distinct(), na.omit()), impute (replace_na(), mean/median/mode), fix types (as.Date(), as.numeric()), standardize strings (str_to_title(), trimws()), validate domain rules (impossible ages/salaries → NA). Always document cleaning steps for reproducibility.

10. Next Chapter Recommendation

In Chapter 15: Data Manipulation with dplyr, we master the tidyverse grammar for data transformation — filter, select, mutate, group_by, and summarise.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·