Introduction to Pandas
# CHAPTER 13
Introduction to Pandas
1. Chapter Introduction
NumPy is fast, but it is limited. It requires homogeneous data (all numbers or all text) and lacks column names. Real-world data is messy—it contains names (text), ages (integers), and salaries (floats) all in one table. To handle this, we need Pandas. Often described as "Excel on steroids," Pandas is the absolute core of the Python data science workflow.2. What is Pandas?
Pandas is a software library built *on top* of NumPy. It provides high-performance, easy-to-use data structures and data analysis tools.
If NumPy is for pure math, Pandas is for data manipulation, cleaning, and preparation.
Installation & Importing:
3. The Two Core Structures
Pandas has two primary data structures: the Series (1D) and the DataFrame (2D).
1. The Pandas Series: A Series is essentially a single column of data. It is a 1D array, but unlike NumPy, it has an explicitly defined "Index" (labels for each row).
2. The Pandas DataFrame: A DataFrame is a 2D table (like an Excel sheet or SQL table). It is essentially a collection of Series sharing the same index.
4. Data Loading
Data Scientists rarely type out DataFrames by hand. They load them from files. Pandas can read almost any file format with a single line of code.
5. Data Inspection
Once the data is loaded, the very first thing you do is inspect it to see what you are dealing with.
6. Mini Project: Employee Dataset Analyzer
Let's simulate loading an employee dataset and profiling it.
7. Common Mistakes
-
Printing the whole DataFrame: If you load a 1-million-row CSV and type
print(df), Jupyter might freeze trying to render it. *Always* usedf.head()to look at the data.
-
Confusing Series and DataFrames: A Series is a single column. A DataFrame is a table. If you extract one column from a DataFrame (
df['Name']), it becomes a Series.
8. MCQs
What is Pandas?
What is the standard alias for importing Pandas?
What is a 1D column of data called in Pandas?
What is a 2D table of data called in Pandas?
How do you load a CSV file into a Pandas DataFrame?
What method should you always use to preview the first 5 rows of a newly loaded DataFrame?
Which method provides a technical summary, showing the number of non-null values and data types for each column?
Which method calculates the mean, minimum, and maximum for all numerical columns?
If a DataFrame has 100 rows and 5 columns, what does df.shape return?
9. Interview Questions
- Q: What is the difference between a Pandas Series and a Pandas DataFrame?
- Q: Walk me through the exact steps (and functions) you would use immediately after loading an unknown CSV file into Pandas.
10. Summary
Pandas is the core of the Data Science workflow. It allows you to load mixed tabular data from CSVs or SQL databases into a DataFrame. Once loaded, usedf.head() to visually inspect the data, df.info() to check for missing values and data types, and df.describe() to understand the basic statistics of your numbers.
11. Next Chapter Recommendation
In Chapter 14: Pandas Series and DataFrames, we will learn how to navigate this data—selecting specific rows, filtering for specific conditions (likeSalary > 50000), and adding new columns.