Skip to main content
R Programming
CHAPTER 24 Beginner

Classification and Clustering

Updated: May 18, 2026
5 min read

# CHAPTER 24

Classification and Clustering in R

1. Chapter Introduction

Classification predicts which category new data belongs to. Clustering groups data without labels. Together they power customer segmentation, fraud detection, medical diagnosis, and recommendation systems.

2. Decision Trees

r
1234567891011121314151617181920212223242526272829303132333435363738
library(rpart)
library(rpart.plot)
library(caret)

set.seed(42)
# Customer loan default prediction
n <- 600
loans <- data.frame(
  income      = round(runif(n, 25000, 150000)),
  debt_ratio  = round(runif(n, 0.1, 0.8), 2),
  age         = sample(21:65, n, replace=TRUE),
  loan_amount = round(runif(n, 5000, 50000)),
  credit_score= sample(450:800, n, replace=TRUE)
)
loans$default <- factor(
  ifelse(loans$debt_ratio > 0.5 & loans$credit_score < 600, "Yes",
         ifelse(loans$income < 40000 & loans$debt_ratio > 0.4, "Yes", "No")),
  levels=c("No","Yes")
)

# Train/test split
train_idx <- createDataPartition(loans$default, p=0.8, list=FALSE)
train <- loans[train_idx, ]
test  <- loans[-train_idx, ]

# Decision Tree
tree_model <- rpart(default ~ ., data=train, method="class",
                     control=rpart.control(maxdepth=5, minsplit=20))

# Visualize tree
rpart.plot(tree_model, type=4, extra=104, fallen.leaves=TRUE,
            main="Loan Default Decision Tree")

# Evaluate
pred_tree <- predict(tree_model, newdata=test, type="class")
cm <- confusionMatrix(pred_tree, test$default, positive="Yes")
cat(sprintf("Decision Tree: Accuracy=%.3f, F1=%.3f\n",
             cm$overall["Accuracy"], cm$byClass["F1"]))

3. K-Nearest Neighbors (KNN)

r
12345678910111213141516171819202122
library(class)

# Standardize features
train_x <- scale(train[, -ncol(train)])
test_x  <- scale(test[, -ncol(test)],
                  center=attr(train_x,"scaled:center"),
                  scale=attr(train_x,"scaled:scale"))

# Find optimal k using cross-validation
k_values <- 1:20
errors <- sapply(k_values, function(k) {
  pred <- knn(train_x, test_x, train$default, k=k)
  mean(pred != test$default)
})
optimal_k <- k_values[which.min(errors)]
cat(sprintf("Optimal k: %d (error: %.3f)\n", optimal_k, min(errors)))

# Final KNN model with optimal k
pred_knn <- knn(train_x, test_x, train$default, k=optimal_k, prob=TRUE)
cm_knn <- confusionMatrix(pred_knn, test$default, positive="Yes")
cat(sprintf("KNN (k=%d): Accuracy=%.3f, F1=%.3f\n",
             optimal_k, cm_knn$overall["Accuracy"], cm_knn$byClass["F1"]))

4. K-Means Clustering + Mini Project

r
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
library(cluster)
library(factoextra)

# ─── CUSTOMER SEGMENTATION SYSTEM ────────────────────
set.seed(42)
n <- 400
customers <- data.frame(
  customer_id    = 1:n,
  annual_spend   = round(runif(n, 500, 10000), -2),
  visit_frequency= sample(1:50, n, replace=TRUE),
  avg_basket     = round(runif(n, 20, 200), -1),
  tenure_months  = sample(3:72, n, replace=TRUE),
  returns_pct    = round(runif(n, 0, 0.3), 2)
)

# Feature selection and scaling
features <- customers %>% select(annual_spend, visit_frequency, avg_basket, tenure_months)
features_scaled <- scale(features)

# Find optimal clusters (elbow method)
wss <- sapply(2:10, function(k) {
  kmeans(features_scaled, centers=k, nstart=25)$tot.withinss
})
# Plot elbow
plot(2:10, wss, type="b", pch=19, main="Elbow Method",
     xlab="Number of Clusters", ylab="Total Within SS")

# Fit k-means with k=4
set.seed(42)
km <- kmeans(features_scaled, centers=4, nstart=25)
customers$cluster <- factor(km$cluster)

# Cluster profiles
cluster_profile <- customers %>%
  group_by(cluster) %>%
  summarise(
    n            = n(),
    avg_spend    = round(mean(annual_spend)),
    avg_visits   = round(mean(visit_frequency), 1),
    avg_basket   = round(mean(avg_basket)),
    avg_tenure   = round(mean(tenure_months)),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_spend))

cat("=== CUSTOMER SEGMENTS ===\n")
print(cluster_profile)

# Name the segments
cluster_names <- c("1"="Champions", "2"="Loyal Customers",
                    "3"="At-Risk", "4"="New Customers")
customers$segment <- cluster_names[as.character(customers$cluster)]
cat("\nSegment Distribution:\n")
print(table(customers$segment))

# Visualize clusters
fviz_cluster(km, data=features_scaled, geom="point",
              ellipse.type="convex", ggtheme=theme_minimal(),
              main="Customer Segmentation (4 Clusters)")

# Silhouette score (cluster quality: -1 to 1, higher is better)
sil <- silhouette(km$cluster, dist(features_scaled))
cat(sprintf("\nAverage Silhouette Score: %.3f\n", mean(sil[,3])))

5. Common Mistakes

  • Not scaling before k-means: k-means uses Euclidean distance — unscaled features with different ranges (salary vs binary flag) make salary dominate the clustering. Always scale first.
  • Choosing k by gut feeling: Always use elbow method, silhouette analysis, or gap statistic to find the optimal k objectively.

6. MCQs

Question 1

Decision tree splits are chosen to?

Question 2

KNN prediction uses?

Question 3

Optimal k in KNN is found by?

Question 4

K-means clustering requires?

Question 5

Elbow method helps choose?

Question 6

scale(x) standardizes to?

Question 7

Silhouette score near +1 means?

Question 8

rpart.plot() visualizes?

Question 9

kmeans(x, nstart=25) means?

Question 10

Hierarchical clustering creates?

7. Interview Questions

  • Q: How do you choose the optimal number of clusters for k-means?
  • Q: What is the difference between classification and clustering?

8. Summary

Decision trees: interpretable rule-based classifiers — rpart(), visualized with rpart.plot(). KNN: distance-based — scale features, tune k via CV. K-means: unsupervised partitioning — scale first, use elbow method for k. Silhouette score evaluates cluster quality. Customer segmentation: RFM-style features → 4 segments (Champions, Loyal, At-Risk, New). Always scale for distance-based algorithms.

9. Next Chapter Recommendation

In Chapter 25: Working with Real-World Datasets, we analyze production-grade datasets from Kaggle with complete data science workflows.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·