CHAPTER 24 Beginner

Classification and Clustering

Updated: May 18, 2026

5 min read

# CHAPTER 24

Classification and Clustering in R

1. Chapter Introduction

Classification predicts which category new data belongs to. Clustering groups data without labels. Together they power customer segmentation, fraud detection, medical diagnosis, and recommendation systems.

2. Decision Trees

1234567891011121314151617181920212223242526272829303132333435363738

library(rpart)
library(rpart.plot)
library(caret)

set.seed(42)
# Customer loan default prediction
n <- 600
loans <- data.frame(
  income      = round(runif(n, 25000, 150000)),
  debt_ratio  = round(runif(n, 0.1, 0.8), 2),
  age         = sample(21:65, n, replace=TRUE),
  loan_amount = round(runif(n, 5000, 50000)),
  credit_score= sample(450:800, n, replace=TRUE)
)
loans$default <- factor(
  ifelse(loans$debt_ratio > 0.5 & loans$credit_score < 600, "Yes",
         ifelse(loans$income < 40000 & loans$debt_ratio > 0.4, "Yes", "No")),
  levels=c("No","Yes")
)

# Train/test split
train_idx <- createDataPartition(loans$default, p=0.8, list=FALSE)
train <- loans[train_idx, ]
test  <- loans[-train_idx, ]

# Decision Tree
tree_model <- rpart(default ~ ., data=train, method="class",
                     control=rpart.control(maxdepth=5, minsplit=20))

# Visualize tree
rpart.plot(tree_model, type=4, extra=104, fallen.leaves=TRUE,
            main="Loan Default Decision Tree")

# Evaluate
pred_tree <- predict(tree_model, newdata=test, type="class")
cm <- confusionMatrix(pred_tree, test$default, positive="Yes")
cat(sprintf("Decision Tree: Accuracy=%.3f, F1=%.3f\n",
             cm$overall["Accuracy"], cm$byClass["F1"]))

3. K-Nearest Neighbors (KNN)

12345678910111213141516171819202122

library(class)

# Standardize features
train_x <- scale(train[, -ncol(train)])
test_x  <- scale(test[, -ncol(test)],
                  center=attr(train_x,"scaled:center"),
                  scale=attr(train_x,"scaled:scale"))

# Find optimal k using cross-validation
k_values <- 1:20
errors <- sapply(k_values, function(k) {
  pred <- knn(train_x, test_x, train$default, k=k)
  mean(pred != test$default)
})
optimal_k <- k_values[which.min(errors)]
cat(sprintf("Optimal k: %d (error: %.3f)\n", optimal_k, min(errors)))

# Final KNN model with optimal k
pred_knn <- knn(train_x, test_x, train$default, k=optimal_k, prob=TRUE)
cm_knn <- confusionMatrix(pred_knn, test$default, positive="Yes")
cat(sprintf("KNN (k=%d): Accuracy=%.3f, F1=%.3f\n",
             optimal_k, cm_knn$overall["Accuracy"], cm_knn$byClass["F1"]))

4. K-Means Clustering + Mini Project

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263

library(cluster)
library(factoextra)

# ─── CUSTOMER SEGMENTATION SYSTEM ────────────────────
set.seed(42)
n <- 400
customers <- data.frame(
  customer_id    = 1:n,
  annual_spend   = round(runif(n, 500, 10000), -2),
  visit_frequency= sample(1:50, n, replace=TRUE),
  avg_basket     = round(runif(n, 20, 200), -1),
  tenure_months  = sample(3:72, n, replace=TRUE),
  returns_pct    = round(runif(n, 0, 0.3), 2)
)

# Feature selection and scaling
features <- customers %>% select(annual_spend, visit_frequency, avg_basket, tenure_months)
features_scaled <- scale(features)

# Find optimal clusters (elbow method)
wss <- sapply(2:10, function(k) {
  kmeans(features_scaled, centers=k, nstart=25)$tot.withinss
})
# Plot elbow
plot(2:10, wss, type="b", pch=19, main="Elbow Method",
     xlab="Number of Clusters", ylab="Total Within SS")

# Fit k-means with k=4
set.seed(42)
km <- kmeans(features_scaled, centers=4, nstart=25)
customers$cluster <- factor(km$cluster)

# Cluster profiles
cluster_profile <- customers %>%
  group_by(cluster) %>%
  summarise(
    n            = n(),
    avg_spend    = round(mean(annual_spend)),
    avg_visits   = round(mean(visit_frequency), 1),
    avg_basket   = round(mean(avg_basket)),
    avg_tenure   = round(mean(tenure_months)),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_spend))

cat("=== CUSTOMER SEGMENTS ===\n")
print(cluster_profile)

# Name the segments
cluster_names <- c("1"="Champions", "2"="Loyal Customers",
                    "3"="At-Risk", "4"="New Customers")
customers$segment <- cluster_names[as.character(customers$cluster)]
cat("\nSegment Distribution:\n")
print(table(customers$segment))

# Visualize clusters
fviz_cluster(km, data=features_scaled, geom="point",
              ellipse.type="convex", ggtheme=theme_minimal(),
              main="Customer Segmentation (4 Clusters)")

# Silhouette score (cluster quality: -1 to 1, higher is better)
sil <- silhouette(km$cluster, dist(features_scaled))
cat(sprintf("\nAverage Silhouette Score: %.3f\n", mean(sil[,3])))

5. Common Mistakes

Not scaling before k-means: k-means uses Euclidean distance — unscaled features with different ranges (salary vs binary flag) make salary dominate the clustering. Always scale first.

Choosing k by gut feeling: Always use elbow method, silhouette analysis, or gap statistic to find the optimal k objectively.

6. MCQs

Question 1

Decision tree splits are chosen to?

Question 2

KNN prediction uses?

Question 3

Optimal k in KNN is found by?

Question 4

K-means clustering requires?

Question 5

Elbow method helps choose?

Question 6

`scale(x)` standardizes to?

Question 7

Silhouette score near +1 means?

Question 8

`rpart.plot()` visualizes?

Question 9

`kmeans(x, nstart=25)` means?

Question 10

Hierarchical clustering creates?

7. Interview Questions

Q: How do you choose the optimal number of clusters for k-means?

Q: What is the difference between classification and clustering?

8. Summary

Decision trees: interpretable rule-based classifiers — rpart(), visualized with rpart.plot(). KNN: distance-based — scale features, tune k via CV. K-means: unsupervised partitioning — scale first, use elbow method for k. Silhouette score evaluates cluster quality. Customer segmentation: RFM-style features → 4 segments (Champions, Loyal, At-Risk, New). Always scale for distance-based algorithms.

9. Next Chapter Recommendation

In Chapter 25: Working with Real-World Datasets, we analyze production-grade datasets from Kaggle with complete data science workflows.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Classification and Clustering in R #

1. Chapter Introduction #

2. Decision Trees #

3. K-Nearest Neighbors (KNN) #

4. K-Means Clustering + Mini Project #

5. Common Mistakes #

6. MCQs #

Decision tree splits are chosen to?

KNN prediction uses?

Optimal k in KNN is found by?

K-means clustering requires?

Elbow method helps choose?

scale(x) standardizes to?

Silhouette score near +1 means?

rpart.plot() visualizes?

kmeans(x, nstart=25) means?

Hierarchical clustering creates?

7. Interview Questions #

8. Summary #

9. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!