Skip to main content
Python for Data Science
CHAPTER 13 Beginner

Introduction to Pandas

Updated: May 18, 2026
5 min read

# CHAPTER 13

Introduction to Pandas

1. Chapter Introduction

NumPy is fast, but it is limited. It requires homogeneous data (all numbers or all text) and lacks column names. Real-world data is messy—it contains names (text), ages (integers), and salaries (floats) all in one table. To handle this, we need Pandas. Often described as "Excel on steroids," Pandas is the absolute core of the Python data science workflow.

2. What is Pandas?

Pandas is a software library built *on top* of NumPy. It provides high-performance, easy-to-use data structures and data analysis tools.

If NumPy is for pure math, Pandas is for data manipulation, cleaning, and preparation.

Installation & Importing:

bash
1
!pip install pandas
python
12
# 'pd' is the universal industry standard alias
import pandas as pd

3. The Two Core Structures

Pandas has two primary data structures: the Series (1D) and the DataFrame (2D).

1. The Pandas Series: A Series is essentially a single column of data. It is a 1D array, but unlike NumPy, it has an explicitly defined "Index" (labels for each row).

python
12345678910
import pandas as pd

# Create a Series from a list
ages = pd.Series([22, 35, 58], name="Age")
print(ages)
# Output:
# 0    22
# 1    35
# 2    58
# Name: Age, dtype: int64

2. The Pandas DataFrame: A DataFrame is a 2D table (like an Excel sheet or SQL table). It is essentially a collection of Series sharing the same index.

python
1234567891011121314
# Create a DataFrame from a Dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Department": ["Sales", "IT", "HR"]
}

df = pd.DataFrame(data)
print(df)
# Output:
#       Name  Age Department
# 0    Alice   25      Sales
# 1      Bob   30         IT
# 2  Charlie   35         HR

4. Data Loading

Data Scientists rarely type out DataFrames by hand. They load them from files. Pandas can read almost any file format with a single line of code.

python
12345678
# The most common command in data science
# Read a CSV file into a DataFrame
df = pd.read_csv('employee_data.csv')

# Other formats:
# pd.read_excel('data.xlsx')
# pd.read_json('data.json')
# pd.read_sql('SELECT * FROM table', connection)

5. Data Inspection

Once the data is loaded, the very first thing you do is inspect it to see what you are dealing with.

python
1234567891011121314
# 1. Look at the top 5 rows (CRITICAL)
print(df.head())

# 2. Look at the bottom 3 rows
print(df.tail(3))

# 3. Get the shape (Rows, Columns)
print(df.shape) 

# 4. Get a technical summary (Column names, null counts, data types)
print(df.info())

# 5. Get a statistical summary (Mean, min, max for numerical columns)
print(df.describe())

6. Mini Project: Employee Dataset Analyzer

Let's simulate loading an employee dataset and profiling it.

python
1234567891011121314151617
import pandas as pd

# Simulating data loading
data = {
    "ID": [101, 102, 103, 104],
    "Name": ["John", "Sarah", "Mike", "Emma"],
    "Salary": [60000, 85000, 50000, 92000],
    "Years_Exp": [2, 5, 1, 7]
}
df = pd.DataFrame(data)

print("--- DATASET PREVIEW ---")
print(df.head(2))

print("\n--- QUICK STATISTICS ---")
# .describe() automatically ignores text columns like 'Name'
print(df.describe())

7. Common Mistakes

  • Printing the whole DataFrame: If you load a 1-million-row CSV and type print(df), Jupyter might freeze trying to render it. *Always* use df.head() to look at the data.
  • Confusing Series and DataFrames: A Series is a single column. A DataFrame is a table. If you extract one column from a DataFrame (df['Name']), it becomes a Series.

8. MCQs

Question 1

What is Pandas?

Question 2

What is the standard alias for importing Pandas?

Question 3

What is a 1D column of data called in Pandas?

Question 4

What is a 2D table of data called in Pandas?

Question 5

How do you load a CSV file into a Pandas DataFrame?

Question 6

What method should you always use to preview the first 5 rows of a newly loaded DataFrame?

Question 7

Which method provides a technical summary, showing the number of non-null values and data types for each column?

Question 8

Which method calculates the mean, minimum, and maximum for all numerical columns?

Question 9

If a DataFrame has 100 rows and 5 columns, what does df.shape return?

Q10. Can a DataFrame hold mixed data types (e.g., text, floats, booleans) across different columns? a) Yes, that is the main advantage over a NumPy matrix b) No, it must be homogeneous — Answer: a

9. Interview Questions

  • Q: What is the difference between a Pandas Series and a Pandas DataFrame?
  • Q: Walk me through the exact steps (and functions) you would use immediately after loading an unknown CSV file into Pandas.

10. Summary

Pandas is the core of the Data Science workflow. It allows you to load mixed tabular data from CSVs or SQL databases into a DataFrame. Once loaded, use df.head() to visually inspect the data, df.info() to check for missing values and data types, and df.describe() to understand the basic statistics of your numbers.

11. Next Chapter Recommendation

In Chapter 14: Pandas Series and DataFrames, we will learn how to navigate this data—selecting specific rows, filtering for specific conditions (like Salary > 50000), and adding new columns.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·