CHAPTER 24
Beginner
Classification and Clustering
Updated: May 18, 2026
5 min read
# CHAPTER 24
Classification and Clustering in R
1. Chapter Introduction
Classification predicts which category new data belongs to. Clustering groups data without labels. Together they power customer segmentation, fraud detection, medical diagnosis, and recommendation systems.2. Decision Trees
r
3. K-Nearest Neighbors (KNN)
r
4. K-Means Clustering + Mini Project
r
5. Common Mistakes
- Not scaling before k-means: k-means uses Euclidean distance — unscaled features with different ranges (salary vs binary flag) make salary dominate the clustering. Always scale first.
- Choosing k by gut feeling: Always use elbow method, silhouette analysis, or gap statistic to find the optimal k objectively.
6. MCQs
Question 1
Decision tree splits are chosen to?
Question 2
KNN prediction uses?
Question 3
Optimal k in KNN is found by?
Question 4
K-means clustering requires?
Question 5
Elbow method helps choose?
Question 6
scale(x) standardizes to?
Question 7
Silhouette score near +1 means?
Question 8
rpart.plot() visualizes?
Question 9
kmeans(x, nstart=25) means?
Question 10
Hierarchical clustering creates?
7. Interview Questions
- Q: How do you choose the optimal number of clusters for k-means?
- Q: What is the difference between classification and clustering?
8. Summary
Decision trees: interpretable rule-based classifiers —rpart(), visualized with rpart.plot(). KNN: distance-based — scale features, tune k via CV. K-means: unsupervised partitioning — scale first, use elbow method for k. Silhouette score evaluates cluster quality. Customer segmentation: RFM-style features → 4 segments (Champions, Loyal, At-Risk, New). Always scale for distance-based algorithms.