Skip to main content
Python for Data Science
CHAPTER 21 Beginner

Introduction to Machine Learning

Updated: May 18, 2026
5 min read

# CHAPTER 21

Introduction to Machine Learning

1. Chapter Introduction

Up until now, you have used Pandas and Seaborn to analyze data from the *past*. You found out that 1st-class passengers on the Titanic survived at a higher rate. But what if you want to predict the *future*? If a new passenger buys a ticket today, will they survive? This is where Machine Learning (ML) comes in. This chapter introduces the concepts of ML and the industry-standard library: Scikit-Learn.

2. What is Machine Learning?

Traditional Programming relies on hardcoded rules: *Data + Rules = Output* (e.g., If Age > 18: output "Adult")

Machine Learning flips this. You give the computer the data and the historical answers, and it *learns the rules itself*: *Data + Historical Answers = Learned Rules (A Model)*

Once the model is trained, you can give it new data, and it will use its learned rules to predict the output.

3. Supervised vs. Unsupervised Learning

There are two main branches of Machine Learning:

1. Supervised Learning: The data has a known "answer" or "label" (This is 90% of ML jobs).

  • Regression: Predicting a continuous number. (e.g., Predicting House Prices).
  • Classification: Predicting a category. (e.g., Is this email Spam or Not Spam?).

2. Unsupervised Learning: The data has NO labels. The algorithm must find hidden structures on its own.

  • Clustering: Grouping similar data together. (e.g., Customer Segmentation based on shopping habits, without knowing who they are).

4. The Standard ML Workflow

Every Machine Learning project follows this exact pipeline:

  1. 1. Prepare Data: Clean NaNs, scale numbers, encode text to numbers.
  1. 2. Split Data: Separate data into a Training Set (to learn) and a Testing Set (to evaluate).
  1. 3. Choose Algorithm: Pick a Scikit-Learn model (e.g., Linear Regression).
  1. 4. Fit (Train): Pass the Training Data to the model so it can learn.
  1. 5. Predict: Ask the model to guess the answers for the Testing Data.
  1. 6. Evaluate: Compare the model's guesses to the real answers to score its accuracy.

5. Introduction to Scikit-Learn

Scikit-Learn (sklearn) is the premier machine learning library for Python. It contains hundreds of algorithms, all accessible via a unified, simple syntax.

Installation:

bash
1
!pip install scikit-learn

6. The Scikit-Learn API Design (The 3-Step Process)

Scikit-Learn is famous for its elegant, consistent design. Every single algorithm works using these same three steps:

python
1234567891011
# 1. IMPORT the algorithm you want to use
from sklearn.linear_model import LinearRegression

# 2. INITIALIZE the model (Create an instance of it)
model = LinearRegression()

# 3. FIT the model (Train it by giving it Data (X) and Answers (y))
# model.fit(X_train, y_train)

# 4. PREDICT (Ask it to guess on new data)
# predictions = model.predict(X_test)

7. Terminology: X and y

In Machine Learning notation:

  • X (Capital X): The Features. The data you are using to make the prediction (e.g., Square Footage, Number of Bedrooms, Zip Code). This is usually a 2D Pandas DataFrame.
  • y (Lowercase y): The Target. The answer you are trying to predict (e.g., House Price). This is usually a 1D Pandas Series.

8. Common Mistakes

  • Thinking ML is magic: ML algorithms are just complex statistical equations. If you feed them terrible, messy data, they will learn terrible, messy rules. (Garbage In, Garbage Out).
  • Using Classification for Numbers: Trying to use a Logistic Regression (Classification) model to predict a House Price (Regression). You must understand your Target variable before choosing an algorithm.

9. MCQs

Question 1

What is the fundamental difference between traditional programming and machine learning?

Question 2

Which branch of ML is used when your data has historical "answers" (labels) to learn from?

Question 3

Predicting whether a tumor is Malignant or Benign is an example of what?

Question 4

Predicting the temperature for tomorrow (e.g., 72.5 degrees) is an example of what?

Question 5

What is the industry-standard Machine Learning library in Python?

Question 6

In ML terminology, what does X represent?

Question 7

In ML terminology, what does y represent?

Question 8

What Scikit-Learn method is called to actually TRAIN the model?

Question 9

Grouping customers into 3 distinct marketing segments based on behavior, without having predefined labels, is an example of?

Question 10

What Scikit-Learn method is called to make guesses on unseen data?

10. Interview Questions

  • Q: Explain the difference between Supervised and Unsupervised Machine Learning. Give a business use-case for both.
  • Q: What is the difference between Classification and Regression tasks?

11. Summary

Machine Learning allows computers to learn patterns from data rather than relying on hardcoded rules. Supervised learning predicts known targets (Regression for numbers, Classification for categories), while Unsupervised learning finds hidden clusters. The entire process is standardized in Python using Scikit-Learn's .fit() and .predict() API, using X for features and y for the target.

12. Next Chapter Recommendation

In Chapter 22: Data Preprocessing for Machine Learning, we will look at the mandatory steps required *before* you can run .fit(). Algorithms only understand numbers, so we must learn how to convert text categories into math, and how to perform Train-Test Splits.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·