CHAPTER 20 Beginner

Correlation and Regression Analysis

Updated: May 18, 2026

5 min read

# CHAPTER 20

Correlation and Regression Analysis in R

1. Chapter Introduction

Regression is R's killer statistical feature — predict, model, and understand relationships between variables. This chapter covers correlation analysis, simple and multiple linear regression, model evaluation, and builds a house price prediction model.

2. Correlation Analysis

1234567891011121314151617181920212223242526272829303132

library(dplyr)
library(ggplot2)
library(corrplot)

# Pearson correlation (linear relationship)
x <- c(30, 45, 52, 60, 72, 81, 90, 102, 115, 125)  # Advertising spend ($K)
y <- c(120, 175, 190, 220, 265, 300, 330, 390, 445, 490)  # Sales ($K)

r <- cor(x, y)
cat(sprintf("Pearson r = %.3f\n", r))  # 0.998 (very strong positive)
# Interpretation: r = 0.998 → nearly perfect linear relationship

# Correlation test (is r significantly different from 0?)
cor.test(x, y)
# cor = 0.998, p-value < 0.001 → significant correlation

# Spearman (rank-based, non-parametric)
cor(x, y, method="spearman")
cor.test(x, y, method="spearman")

# Kendall's tau
cor(x, y, method="kendall")

# Correlation matrix
data <- mtcars[, c("mpg","hp","wt","disp","qsec")]
cor_matrix <- cor(data)
print(round(cor_matrix, 2))

# Visualize correlation matrix
corrplot(cor_matrix, method="color", type="upper",
          addCoef.col="black", number.cex=0.8,
          col=colorRampPalette(c("#F44336","white","#1565C0"))(200))

3. Simple Linear Regression

12345678910111213141516171819202122232425262728

# SLR: y = β₀ + β₁x + ε
# H₀: β₁ = 0 (no linear relationship)
model <- lm(Sales ~ Advertising, data=data.frame(Advertising=x, Sales=y))
summary(model)

# Extract coefficients
coef(model)       # β₀ (intercept) and β₁ (slope)
confint(model)    # 95% confidence intervals for coefficients

# Model diagnostics
cat("R-squared:  ", summary(model)$r.squared, "\n")  # % variance explained
cat("Adj R²:     ", summary(model)$adj.r.squared, "\n")
cat("RMSE:       ", sqrt(mean(residuals(model)^2)), "\n")
cat("F-statistic:", summary(model)$fstatistic[1], "\n")

# Predictions
new_data <- data.frame(Advertising=c(50, 75, 100))
predict(model, new_data)               # Point predictions
predict(model, new_data, interval="confidence")  # 95% CI
predict(model, new_data, interval="prediction")  # 95% PI (wider)

# Visualization
ggplot(data.frame(x, y), aes(x, y)) +
  geom_point(color="#1565C0", size=3) +
  geom_smooth(method="lm", color="red", se=TRUE) +
  labs(title=sprintf("Linear Regression (R² = %.3f)", summary(model)$r.squared),
       x="Advertising Spend ($K)", y="Sales ($K)") +
  theme_minimal()

4. Multiple Linear Regression + Mini Project

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

# ─── HOUSE PRICE PREDICTION MODEL ────────────────────
set.seed(42)
n <- 200
houses <- data.frame(
  size_sqft  = round(runif(n, 800, 3500)),
  bedrooms   = sample(2:6, n, replace=TRUE),
  bathrooms  = sample(1:4, n, replace=TRUE),
  age_years  = sample(1:50, n, replace=TRUE),
  garage     = sample(0:2, n, replace=TRUE),
  location   = sample(c("Urban","Suburban","Rural"), n, replace=TRUE)
)
# Generate realistic prices
houses$price <- round(
  150000 +
  houses$size_sqft * 120 +
  houses$bedrooms * 8000 +
  houses$bathrooms * 12000 -
  houses$age_years * 1500 +
  houses$garage * 15000 +
  ifelse(houses$location=="Urban", 50000, ifelse(houses$location=="Suburban", 20000, 0)) +
  rnorm(n, 0, 15000), -3)

# Train/test split (80/20)
set.seed(42)
train_idx <- sample(1:n, floor(0.8*n))
train <- houses[train_idx, ]
test  <- houses[-train_idx, ]

# Fit MLR model
model_mlr <- lm(price ~ size_sqft + bedrooms + bathrooms + age_years + garage + location,
                 data=train)
summary(model_mlr)

# Evaluate on test set
predictions <- predict(model_mlr, newdata=test)
actuals     <- test$price
rmse <- sqrt(mean((predictions - actuals)^2))
mae  <- mean(abs(predictions - actuals))
r2   <- cor(predictions, actuals)^2

cat("\n=== MODEL EVALUATION ===\n")
cat(sprintf("Train R²: %.4f\n", summary(model_mlr)$r.squared))
cat(sprintf("Test R²:  %.4f\n", r2))
cat(sprintf("RMSE:     $%.0f\n", rmse))
cat(sprintf("MAE:      $%.0f\n", mae))

# Feature importance (standardized coefficients)
cat("\nFeature Importance (coefficient t-values):\n")
coef_summary <- summary(model_mlr)$coefficients
coef_summary <- coef_summary[order(abs(coef_summary[,"t value"]), decreasing=TRUE), ]
print(round(coef_summary, 3))

# Diagnostic plots
par(mfrow=c(2,2))
plot(model_mlr)
par(mfrow=c(1,1))

5. Common Mistakes

Correlation ≠ Causation: Strong correlation between ice cream sales and drownings (both increase in summer) doesn't mean ice cream causes drowning. Always consider confounders.

Not checking regression assumptions: Linear regression assumes: linearity, normality of residuals, homoscedasticity, no multicollinearity. Always plot plot(model) for diagnostics.

6. MCQs

Question 1

Pearson correlation r = -0.9 means?

Question 2

`lm(y ~ x, data)` fits?

Question 3

R-squared measures?

Question 4

`predict(model, newdata, interval="prediction")` provides?

Question 5

`cor.test()` tests?

Question 6

Multiple regression `lm(y ~ x1 + x2 + x3)`?

Question 7

RMSE measures?

Question 8

`residuals(model)` extracts?

Question 9

Spearman correlation is preferred when?

Question 10

`plot(model)` produces?

7. Interview Questions

Q: What is the difference between R-squared and adjusted R-squared?

Q: How do you check linear regression assumptions in R?

8. Summary

Correlation: cor() (Pearson, Spearman, Kendall), cor.test() for significance. Simple regression: lm(y ~ x), evaluate with R², RMSE. Multiple regression: lm(y ~ x1 + x2 + ...). Predictions: predict(model, newdata). Model diagnostics: plot(model) for 4 assumption checks. Feature importance via t-values. Always split train/test for unbiased evaluation. Correlation ≠ causation.

9. Next Chapter Recommendation

In Chapter 21: Time Series Analysis in R, we analyze temporal data — trends, seasonality, decomposition, and forecasting.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Correlation and Regression Analysis in R #

1. Chapter Introduction #

2. Correlation Analysis #

3. Simple Linear Regression #

4. Multiple Linear Regression + Mini Project #

5. Common Mistakes #

6. MCQs #

Pearson correlation r = -0.9 means?

lm(y ~ x, data) fits?

R-squared measures?

predict(model, newdata, interval="prediction") provides?

cor.test() tests?

Multiple regression lm(y ~ x1 + x2 + x3)?

RMSE measures?

residuals(model) extracts?

Spearman correlation is preferred when?

plot(model) produces?

7. Interview Questions #

8. Summary #

9. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!