Introduction to Machine Learning
# CHAPTER 21
Introduction to Machine Learning
1. Chapter Introduction
Up until now, you have used Pandas and Seaborn to analyze data from the *past*. You found out that 1st-class passengers on the Titanic survived at a higher rate. But what if you want to predict the *future*? If a new passenger buys a ticket today, will they survive? This is where Machine Learning (ML) comes in. This chapter introduces the concepts of ML and the industry-standard library: Scikit-Learn.2. What is Machine Learning?
Traditional Programming relies on hardcoded rules:
*Data + Rules = Output* (e.g., If Age > 18: output "Adult")
Machine Learning flips this. You give the computer the data and the historical answers, and it *learns the rules itself*: *Data + Historical Answers = Learned Rules (A Model)*
Once the model is trained, you can give it new data, and it will use its learned rules to predict the output.
3. Supervised vs. Unsupervised Learning
There are two main branches of Machine Learning:
1. Supervised Learning: The data has a known "answer" or "label" (This is 90% of ML jobs).
- Regression: Predicting a continuous number. (e.g., Predicting House Prices).
- Classification: Predicting a category. (e.g., Is this email Spam or Not Spam?).
2. Unsupervised Learning: The data has NO labels. The algorithm must find hidden structures on its own.
- Clustering: Grouping similar data together. (e.g., Customer Segmentation based on shopping habits, without knowing who they are).
4. The Standard ML Workflow
Every Machine Learning project follows this exact pipeline:
- 1. Prepare Data: Clean NaNs, scale numbers, encode text to numbers.
- 2. Split Data: Separate data into a Training Set (to learn) and a Testing Set (to evaluate).
- 3. Choose Algorithm: Pick a Scikit-Learn model (e.g., Linear Regression).
- 4. Fit (Train): Pass the Training Data to the model so it can learn.
- 5. Predict: Ask the model to guess the answers for the Testing Data.
- 6. Evaluate: Compare the model's guesses to the real answers to score its accuracy.
5. Introduction to Scikit-Learn
Scikit-Learn (sklearn) is the premier machine learning library for Python. It contains hundreds of algorithms, all accessible via a unified, simple syntax.
Installation:
6. The Scikit-Learn API Design (The 3-Step Process)
Scikit-Learn is famous for its elegant, consistent design. Every single algorithm works using these same three steps:
7. Terminology: X and y
In Machine Learning notation:
-
X(Capital X): The Features. The data you are using to make the prediction (e.g., Square Footage, Number of Bedrooms, Zip Code). This is usually a 2D Pandas DataFrame.
-
y(Lowercase y): The Target. The answer you are trying to predict (e.g., House Price). This is usually a 1D Pandas Series.
8. Common Mistakes
- Thinking ML is magic: ML algorithms are just complex statistical equations. If you feed them terrible, messy data, they will learn terrible, messy rules. (Garbage In, Garbage Out).
- Using Classification for Numbers: Trying to use a Logistic Regression (Classification) model to predict a House Price (Regression). You must understand your Target variable before choosing an algorithm.
9. MCQs
What is the fundamental difference between traditional programming and machine learning?
Which branch of ML is used when your data has historical "answers" (labels) to learn from?
Predicting whether a tumor is Malignant or Benign is an example of what?
Predicting the temperature for tomorrow (e.g., 72.5 degrees) is an example of what?
What is the industry-standard Machine Learning library in Python?
In ML terminology, what does X represent?
In ML terminology, what does y represent?
What Scikit-Learn method is called to actually TRAIN the model?
Grouping customers into 3 distinct marketing segments based on behavior, without having predefined labels, is an example of?
What Scikit-Learn method is called to make guesses on unseen data?
10. Interview Questions
- Q: Explain the difference between Supervised and Unsupervised Machine Learning. Give a business use-case for both.
- Q: What is the difference between Classification and Regression tasks?
11. Summary
Machine Learning allows computers to learn patterns from data rather than relying on hardcoded rules. Supervised learning predicts known targets (Regression for numbers, Classification for categories), while Unsupervised learning finds hidden clusters. The entire process is standardized in Python using Scikit-Learn's.fit() and .predict() API, using X for features and y for the target.
12. Next Chapter Recommendation
In Chapter 22: Data Preprocessing for Machine Learning, we will look at the mandatory steps required *before* you can run.fit(). Algorithms only understand numbers, so we must learn how to convert text categories into math, and how to perform Train-Test Splits.