Skip to main content
R Programming
CHAPTER 23 Beginner

Machine Learning Basics in R

Updated: May 18, 2026
5 min read

# CHAPTER 23

Machine Learning Basics in R

1. Chapter Introduction

R's caret (Classification and Regression Training) package provides a unified interface for 200+ ML algorithms — one consistent API for training, tuning, and evaluating any model. This chapter builds the complete ML workflow.

2. ML Workflow in R

text
12345678910111213141516171819202122232425
MACHINE LEARNING WORKFLOW:

Raw Data
    ↓
Data Preprocessing
  • Cleaning
  • Feature engineering
  • Scaling/normalization
  • Encoding categoricals
    ↓
Train/Test Split
  • 70-80% training
  • 20-30% testing
    ↓
Model Training
  • Choose algorithm
  • Set hyperparameters
  • Cross-validation
    ↓
Model Evaluation
  • Accuracy, RMSE, F1
  • ROC curve, AUC
    ↓
Prediction
  • Apply to new data

3. Data Preprocessing and Splitting

r
123456789101112131415161718192021222324252627282930313233343536373839404142
library(caret)
library(dplyr)

set.seed(42)
# Generate classification dataset: predict employee churn
n <- 500
employees <- data.frame(
  age         = sample(22:60, n, replace=TRUE),
  salary      = round(runif(n, 35000, 120000), -3),
  tenure      = sample(1:20, n, replace=TRUE),
  satisfaction= round(runif(n, 1, 10), 1),
  distance    = sample(1:50, n, replace=TRUE),
  dept        = sample(c("IT","HR","Finance","Sales"), n, replace=TRUE)
)
# Churn more likely: low satisfaction, long commute, low salary
employees$churn <- factor(
  rbinom(n, 1, prob = plogis(-3 + (-employees$satisfaction*0.3) +
                               (employees$distance*0.05) + (-employees$salary/50000))),
  labels = c("No", "Yes")
)

cat("Churn Distribution:\n")
print(prop.table(table(employees$churn)) * 100)

# One-hot encode categorical variables
employees_encoded <- dummyVars(~ ., data=employees %>% select(-churn),
                                fullRank=TRUE) %>%
  predict(employees %>% select(-churn)) %>%
  as.data.frame()
employees_encoded$churn <- employees$churn

# Train/Test split (80/20)
set.seed(42)
train_idx <- createDataPartition(employees_encoded$churn, p=0.8, list=FALSE)
train <- employees_encoded[train_idx, ]
test  <- employees_encoded[-train_idx, ]
cat(sprintf("Train: %d | Test: %d\n", nrow(train), nrow(test)))

# Feature scaling (for distance-based algorithms)
preProc <- preProcess(train %>% select(-churn), method=c("center","scale"))
train_scaled <- predict(preProc, train)
test_scaled  <- predict(preProc, test)

4. Model Training with caret

r
12345678910111213141516171819202122232425
# Cross-validation setup (10-fold CV)
ctrl <- trainControl(
  method          = "cv",           # k-fold cross-validation
  number          = 10,             # 10 folds
  classProbs      = TRUE,           # Compute class probabilities
  summaryFunction = twoClassSummary,
  savePredictions = TRUE
)

# Train multiple models
set.seed(42)
# Logistic Regression
model_lr <- train(churn ~ ., data=train_scaled, method="glm",
                   family="binomial", trControl=ctrl, metric="ROC")
# k-Nearest Neighbors
model_knn <- train(churn ~ ., data=train_scaled, method="knn",
                    tuneLength=10, trControl=ctrl, metric="ROC")
# Random Forest
model_rf <- train(churn ~ ., data=train_scaled, method="rf",
                   tuneLength=5, trControl=ctrl, metric="ROC")

# Compare models
results <- resamples(list(LR=model_lr, KNN=model_knn, RF=model_rf))
summary(results)
dotplot(results, main="Model Comparison — Cross-Validated ROC")

5. Model Evaluation

r
12345678910111213141516171819202122232425
# Predict on test set
pred_lr  <- predict(model_lr,  newdata=test_scaled)
pred_rf  <- predict(model_rf,  newdata=test_scaled)

# Confusion Matrix
cm_rf <- confusionMatrix(pred_rf, test_scaled$churn, positive="Yes")
print(cm_rf)

cat(sprintf("\nRandom Forest Performance:\n"))
cat(sprintf("  Accuracy:  %.3f\n", cm_rf$overall["Accuracy"]))
cat(sprintf("  Precision: %.3f\n", cm_rf$byClass["Precision"]))
cat(sprintf("  Recall:    %.3f\n", cm_rf$byClass["Recall"]))
cat(sprintf("  F1 Score:  %.3f\n", cm_rf$byClass["F1"]))
cat(sprintf("  Kappa:     %.3f\n", cm_rf$overall["Kappa"]))

# ROC Curve
prob_rf <- predict(model_rf, newdata=test_scaled, type="prob")[, "Yes"]
library(pROC)
roc_obj <- roc(test_scaled$churn, prob_rf, levels=c("No","Yes"))
cat(sprintf("  AUC:       %.3f\n", auc(roc_obj)))
plot(roc_obj, main="ROC Curve — Random Forest", col="#1565C0", lwd=2)

# Feature Importance
imp <- varImp(model_rf)
plot(imp, main="Feature Importance — Random Forest")

6. Common Mistakes

  • Data leakage: Scaling with test set statistics (fitting preProcess on train+test combined) leaks test information into training. Always fit preprocessing on train only, apply to test.
  • Accuracy for imbalanced data: If 95% of customers don't churn, a model predicting "No" for everyone achieves 95% accuracy — but is useless. Use F1, AUC, or precision-recall.

7. MCQs

Question 1

createDataPartition() ensures?

Question 2

10-fold cross-validation does?

Question 3

confusionMatrix() requires?

Question 4

AUC of 0.5 means?

Question 5

Feature scaling is critical for?

Question 6

preProcess(train, method=c("center","scale")) applies?

Question 7

varImp(model) shows?

Question 8

Recall (sensitivity) measures?

Question 9

Data leakage in ML occurs when?

Question 10

F1 Score is useful when?

8. Interview Questions

  • Q: What is cross-validation and why is it important?
  • Q: What is data leakage and how do you prevent it?

9. Summary

ML in R with caret: createDataPartition() for stratified split, preProcess() for scaling (fit on train only!), trainControl(method="cv") for cross-validation, train() for model fitting. Evaluate with confusionMatrix(): accuracy, precision, recall, F1. For imbalanced data: use AUC-ROC, F1 over accuracy. resamples() + dotplot() for model comparison.

10. Next Chapter Recommendation

In Chapter 24: Classification and Clustering, we implement decision trees, KNN, k-means clustering, and build a customer segmentation system.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·