Skip to main content
Pandas & NumPy
CHAPTER 12 Beginner

Data Selection and Filtering

Updated: May 18, 2026
5 min read

# CHAPTER 12

Data Selection and Filtering

1. Chapter Introduction

Selecting the right subset of data is the most frequent Pandas operation. This chapter covers all four access methods — loc, iloc, boolean indexing, and query() — with real-world patterns.

2. loc — Label-Based Selection

python
1234567891011121314151617181920
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
    'Dept': ['Eng', 'Mkt', 'Eng', 'Sales', 'HR'],
    'Salary': [85000, 62000, 91000, 55000, 58000],
    'Experience': [5, 3, 8, 2, 4],
    'Rating': [4.5, 3.8, 4.9, 3.5, 4.1]
}, index=['E001', 'E002', 'E003', 'E004', 'E005'])

# loc[row_label, col_label] — label-based
print(df.loc['E001'])                        # Single row
print(df.loc['E001', 'Salary'])              # Single value: 85000
print(df.loc['E001':'E003'])                 # Row slice (inclusive!)
print(df.loc['E001':'E003', ['Name', 'Salary']])  # Rows + specific cols
print(df.loc[:, 'Salary':'Rating'])          # All rows, col range

# Update using loc
df.loc['E001', 'Salary'] = 90000
df.loc['E002', ['Rating', 'Experience']] = [4.0, 4]

3. iloc — Position-Based Selection

python
123456789101112
# iloc[row_pos, col_pos] — integer position-based
print(df.iloc[0])                   # First row
print(df.iloc[0, 2])                # Row 0, col 2: Salary
print(df.iloc[1:4])                 # Rows 1,2,3 (stop exclusive)
print(df.iloc[::2])                 # Every other row
print(df.iloc[:, [0, 2]])           # Cols 0 and 2 (Name, Salary)
print(df.iloc[-1])                  # Last row
print(df.iloc[0:3, 0:3])           # Top-left 3x3 block

# Key difference: loc slice is INCLUSIVE, iloc is EXCLUSIVE at end
# df.loc['E001':'E003'] → 3 rows (E001, E002, E003)
# df.iloc[0:3]          → 3 rows (0, 1, 2) same result here

4. Boolean Filtering

python
12345678910111213141516171819202122232425
df = pd.read_csv('employees.csv') if False else pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve', 'Frank'],
    'Dept': ['Eng', 'Mkt', 'Eng', 'Sales', 'HR', 'Eng'],
    'Salary': [85000, 62000, 91000, 55000, 58000, 78000],
    'Experience': [5, 3, 8, 2, 4, 6],
    'Rating': [4.5, 3.8, 4.9, 3.5, 4.1, 4.3]
})

# Single condition
high_salary = df[df['Salary'] > 70000]
print(high_salary)

# Multiple conditions (& = AND, | = OR, ~ = NOT)
senior_eng = df[(df['Dept'] == 'Eng') & (df['Experience'] >= 5)]
print(senior_eng)

low_or_high = df[(df[&#039;Salary'] < 60000) | (df['Salary'] > 85000)]
not_sales = df[~(df[&#039;Dept'] == 'Sales')]

# isin — filter by list of values
selected_depts = df[df[&#039;Dept'].isin(['Eng', 'HR'])]

# str operations on string columns
engineers = df[df[&#039;Name'].str.startswith('A')]
upper_case = df[df[&#039;Dept'].str.upper() == 'ENG']

5. query() — SQL-Like Filtering

python
1234567891011121314
# query() — cleaner syntax for complex filters
high_rated = df.query("Rating > 4.0")
senior_high_pay = df.query("Experience >= 5 and Salary > 75000")

# Using Python variables in query (@ prefix)
min_salary = 65000
max_exp = 7
result = df.query("Salary > @min_salary and Experience <= @max_exp")

# String queries
eng_staff = df.query("Dept == &#039;Eng'")
not_mkt = df.query("Dept != &#039;Mkt'")

print(df.query("Salary > 70000 and Rating >= 4.0"))

6. Selecting and Filtering Together

python
1234567891011121314
# Get specific columns after filtering
result = df.loc[df[&#039;Dept'] == 'Eng', ['Name', 'Salary', 'Rating']]

# Get all salaries above median as Series
median_sal = df[&#039;Salary'].median()
above_median = df.loc[df[&#039;Salary'] > median_sal, 'Salary']

# Update filtered rows
df.loc[df[&#039;Rating'] >= 4.5, 'Level'] = 'Star'
df.loc[df[&#039;Rating'] < 4.5, 'Level'] = 'Standard'

# at and iat — fastest single-cell access
print(df.at[0, &#039;Name'])     # Label-based single cell
print(df.iat[0, 0])         # Position-based single cell

7. Common Mistakes

  • df[df['col'] == 'val']['other'] = x: This is chained indexing — doesn't reliably update. Use df.loc[condition, 'other'] = x.
  • and/or vs &/|: Python's and/or doesn't work element-wise. Use & and | with parentheses around each condition.

8. MCQs

Question 1

df.loc[0:3] on default index returns?

Question 2

df.iloc[0:3] returns?

Question 3

df[df['Dept'].isin(['Eng', 'HR'])] filters?

Question 4

~mask in boolean filter?

Question 5

df.query("Salary > @threshold") uses?

Question 6

df.at[0, 'Name'] is faster than df.loc[0, 'Name'] because?

Question 7

df.loc[:, 'A':'C'] selects?

Question 8

Boolean condition with | requires?

Question 9

df[df['Name'].str.startswith('A')] returns?

Question 10

Correct way to update filtered rows?

9. Interview Questions

  • Q: What is the difference between loc and iloc in Pandas?
  • Q: How do you filter a DataFrame with multiple conditions?

10. Summary

Four selection tools: loc (label), iloc (position), boolean filtering (condition masks), query() (SQL-like strings). Always use loc for conditional updates to avoid chained indexing. Parentheses around each condition in multi-condition boolean filters are mandatory.

11. Next Chapter Recommendation

In Chapter 13: Data Cleaning in Pandas, we tackle real-world messy data — duplicates, formatting issues, inconsistent values, and column standardization.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·