Setting Up Python and Scikit-learn Environment
# CHAPTER 2
Setting Up Python and Scikit-learn Environment
1. Introduction
A chef cannot cook without a kitchen, and a data scientist cannot build models without the proper environment. Before we write our first machine learning algorithm, we need to install Python, an Integrated Development Environment (IDE), and the necessary scientific libraries. In this chapter, we will set up a professional Machine Learning development environment that will serve you throughout your career.2. Learning Objectives
By the end of this chapter, you will be able to:- Install the latest version of Python on Windows, macOS, or Linux.
- Understand and create Python Virtual Environments.
- Install VS Code and the Jupyter Notebook extension.
-
Use
pipto install Scikit-learn, NumPy, and Pandas.
- Verify your installation with a simple test script.
3. Installing Python
Scikit-learn requires Python. It is highly recommended to use Python 3.8 or newer.Windows:
-
1.
Go to
python.org/downloads.
- 2. Download the Windows installer.
- 3. CRITICAL STEP: When you run the installer, you must check the box that says "Add Python to PATH" at the bottom of the window before clicking "Install Now".
macOS:
- 1. macOS comes with Python pre-installed, but it is often an older version.
-
2.
The best way to install modern Python on a Mac is using Homebrew. Open your terminal and run:
brew install python
Linux (Ubuntu/Debian):
Open your terminal and run:
sudo apt update
sudo apt install python3 python3-pip python3-venv
4. Virtual Environments
When you work on multiple Python projects, they might require different versions of libraries (e.g., Project A needs Scikit-learn 1.0, Project B needs Scikit-learn 1.2). If you install everything globally, projects will conflict and break. Virtual Environments solve this by creating an isolated folder for each project's dependencies.Let's create a folder for our course and set up an environment:
- 1. Open your terminal/command prompt.
-
2.
Create a folder:
mkdir ml_courseand enter it:cd ml_course
-
3.
Create the virtual environment (named
env):
-
Windows:
python -m venv env
-
Mac/Linux:
python3 -m venv env
- 4. Activate the environment:
-
Windows Command Prompt:
env\Scripts\activate
-
Windows PowerShell:
.\env\Scripts\Activate.ps1
-
Mac/Linux:
source env/bin/activate
*You will know it worked because your terminal prompt will now start with (env).*
5. Installing the ML Libraries
With the virtual environment active, we use Python's package manager,pip, to install our tools.
Run this command in your terminal:
*Note: This will download and install the core libraries we need for machine learning, data manipulation, and visualization.*
6. Setting Up VS Code and Jupyter Notebooks
While you can write ML code in a standard.py file, Data Scientists prefer Jupyter Notebooks. Notebooks allow you to write code in chunks (cells), run them one at a time, and see visualizations (like charts and graphs) directly beneath the code.
-
1.
Download and install Visual Studio Code (VS Code) from
code.visualstudio.com.
- 2. Open VS Code and go to the Extensions tab (the squares icon on the left).
- 3. Search for and install the Python extension (by Microsoft).
- 4. Search for and install the Jupyter extension (by Microsoft).
7. Step-by-Step Implementation: Your First Notebook
Let's verify everything is working.-
1.
In VS Code, open the
ml_coursefolder you created earlier.
-
2.
Create a new file named
test.ipynb. The.ipynbextension tells VS Code this is a Jupyter Notebook.
-
3.
Open
test.ipynb. In the top right corner, click "Select Kernel" -> "Python Environments" -> and select theenvvirtual environment we created earlier.
- 4. Type the following code into the first cell and click the "Play" button next to the cell to run it:
*If this prints the versions without an error, you are ready to go!*
8. Alternative Setup: Google Colab
If your computer is very old or you are struggling with the installation, you can skip all of this and use Google Colab. Google Colab is a free, cloud-based Jupyter Notebook environment that runs in your browser. Python, Scikit-learn, and Pandas are already pre-installed. Just go tocolab.research.google.com and start coding!
9. Common Mistakes
-
Forgetting to activate the virtual environment: If you open a new terminal tomorrow and run
python script.py, you will get aModuleNotFoundErrorbecause you forgot to run theactivatecommand first.
-
Installing globally using
sudo pipon Mac/Linux: This can overwrite system Python libraries and break your operating system. Always use virtual environments.
10. Best Practices
-
requirements.txt: When you share your code, others need to know what libraries to install. You can automatically generate a list of your installed libraries by running
pip freeze > requirements.txt.
11. Exercises
-
1.
Open your terminal, create a new folder called
ml_practice, create a virtual environment inside it, activate it, and install onlynumpy.
-
2.
Create a Jupyter Notebook, import numpy as
np, and printnp.array([1, 2, 3]).
12. MCQ Quiz with Answers
Why is it highly recommended to use Python Virtual Environments?
Which tool allows Data Scientists to write Python code in interactive cells and view charts directly inline?
13. Interview Questions
-
Q: Explain what
pipis and how it relates to Scikit-learn.
-
Q: What is the difference between writing Python code in a
.pyfile versus a.ipynb(Jupyter Notebook) file?