Skip to main content
Pandas & NumPy
CHAPTER 30 Beginner

Final Projects and Real-World Applications

Updated: May 18, 2026
5 min read

# CHAPTER 30

Final Projects and Real-World Applications

1. Chapter Introduction

This final chapter synthesizes every skill from the course into 6 complete, portfolio-ready data science projects — from financial analytics to machine learning preprocessing pipelines.

---

Project 1: Financial Dashboard

python
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import pandas as pd
import numpy as np

np.random.seed(42)

class FinancialDashboard:
    def __init__(self, tickers, start='2024-01-01', periods=252):
        dates = pd.date_range(start, periods=periods, freq='B')
        self.prices = pd.DataFrame({
            ticker: 100 * np.cumprod(1 + np.random.normal(0.0005, 0.02, periods))
            for ticker in tickers
        }, index=dates)
        self.returns = self.prices.pct_change().dropna()

    def summary(self):
        print("=== FINANCIAL DASHBOARD ===")
        print(f"\nAnalysis period: {self.prices.index[0].date()} to {self.prices.index[-1].date()}")
        annual_ret = (1 + self.returns.mean()) ** 252 - 1
        annual_vol = self.returns.std() * np.sqrt(252)
        sharpe = annual_ret / annual_vol
        max_dd = ((self.prices - self.prices.cummax()) / self.prices.cummax()).min()

        summary = pd.DataFrame({
            'Annual Return (%)': (annual_ret * 100).round(2),
            'Annual Vol (%)': (annual_vol * 100).round(2),
            'Sharpe Ratio': sharpe.round(3),
            'Max Drawdown (%)': (max_dd * 100).round(2),
            'Final Price': self.prices.iloc[-1].round(2)
        })
        print(summary)
        return summary

    def correlation_analysis(self):
        print("\nReturn Correlation Matrix:")
        print(self.returns.corr().round(3))

    def top_days(self, n=5):
        best = self.returns.sum(axis=1).nlargest(n)
        worst = self.returns.sum(axis=1).nsmallest(n)
        print(f"\nTop {n} Best Portfolio Days:")
        print(best.round(4))
        print(f"\nTop {n} Worst Portfolio Days:")
        print(worst.round(4))

dash = FinancialDashboard(['AAPL', 'GOOGL', 'MSFT', 'AMZN'])
dash.summary()
dash.correlation_analysis()
dash.top_days()

---

Project 2: HR Analytics System

python
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import pandas as pd
import numpy as np

class HRAnalytics:
    def __init__(self, n=500):
        np.random.seed(42)
        self.df = pd.DataFrame({
            'Employee_ID': [f'E{i:04d}' for i in range(1, n+1)],
            'Dept': np.random.choice(['Engineering','Marketing','Sales','HR','Finance'], n,
                                      p=[0.35, 0.20, 0.25, 0.10, 0.10]),
            'Gender': np.random.choice(['Male','Female'], n, p=[0.55, 0.45]),
            'Age': np.random.randint(22, 60, n),
            'Tenure': np.random.randint(1, 20, n),
            'Salary': np.random.normal(75000, 20000, n).clip(30000, 200000).astype(int),
            'Performance': np.random.choice([1,2,3,4,5], n, p=[0.05,0.15,0.35,0.30,0.15]),
            'Satisfaction': np.random.uniform(1, 10, n).round(1),
            'Left': np.nan
        })
        # Attrition model: low satisfaction + low performance → higher churn
        churn_p = (0.3 * (self.df[&#039;Satisfaction'] < 5).astype(float) +
                   0.2 * (self.df[&#039;Performance'] <= 2).astype(float) +
                   0.15 * (self.df[&#039;Tenure'] < 2).astype(float))
        self.df[&#039;Left'] = (np.random.random(n) < churn_p / churn_p.max() * 0.35).astype(int)

    def headcount_report(self):
        print("=== HEADCOUNT REPORT ===")
        print(self.df.groupby(&#039;Dept').agg(
            Count=(&#039;Employee_ID','count'),
            Avg_Salary=(&#039;Salary','mean'),
            Avg_Performance=(&#039;Performance','mean'),
            Avg_Satisfaction=(&#039;Satisfaction','mean')
        ).round(2))

    def diversity_report(self):
        print("\n=== GENDER DIVERSITY ===")
        pivot = pd.crosstab(self.df[&#039;Dept'], self.df['Gender'], normalize='index') * 100
        print(pivot.round(1))

    def attrition_report(self):
        print("\n=== ATTRITION ANALYSIS ===")
        overall = self.df[&#039;Left'].mean() * 100
        print(f"Overall attrition rate: {overall:.1f}%")
        print("\nAttrition by Department:")
        print((self.df.groupby(&#039;Dept')['Left'].mean() * 100).round(1).sort_values(ascending=False))
        high_risk = self.df[(self.df[&#039;Satisfaction'] < 4) & (self.df['Left'] == 0)]
        print(f"\nHigh-risk employees (low satisfaction, still active): {len(high_risk)}")

hr = HRAnalytics()
hr.headcount_report()
hr.diversity_report()
hr.attrition_report()

---

Project 3: ML Preprocessing Pipeline

python
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

class MLPreprocessingPipeline:
    """Complete, reusable ML preprocessing pipeline."""

    def __init__(self):
        self.imputer = SimpleImputer(strategy=&#039;median')
        self.scaler  = StandardScaler()
        self.fitted  = False

    def clean(self, df):
        df = df.drop_duplicates()
        df = df.dropna(thresh=int(len(df.columns) * 0.5))  # Drop rows with >50% null
        return df

    def encode(self, df, cat_cols):
        return pd.get_dummies(df, columns=cat_cols, drop_first=True)

    def fit_transform(self, df, target_col, cat_cols=None, drop_cols=None):
        df = self.clean(df)
        if drop_cols: df = df.drop(columns=drop_cols)
        if cat_cols: df = self.encode(df, cat_cols)
        X = df.drop(columns=[target_col])
        y = df[target_col]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        X_train_imputed = self.imputer.fit_transform(X_train)
        X_test_imputed  = self.imputer.transform(X_test)
        X_train_scaled  = self.scaler.fit_transform(X_train_imputed)
        X_test_scaled   = self.scaler.transform(X_test_imputed)
        self.fitted = True
        self.feature_names = X_train.columns.tolist()
        print(f"Pipeline complete: {X_train_scaled.shape[0]} train, {X_test_scaled.shape[0]} test samples")
        print(f"Features: {len(self.feature_names)}")
        return X_train_scaled, X_test_scaled, y_train, y_test

# Usage
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    &#039;Age': np.random.normal(35, 10, n).clip(18, 70).astype(int),
    &#039;Salary': np.random.normal(60000, 20000, n),
    &#039;Dept': np.random.choice(['Eng','Mkt','Sales'], n),
    &#039;Experience': np.random.randint(0, 30, n),
    &#039;Left': np.random.choice([0, 1], n, p=[0.75, 0.25])
})
df.loc[np.random.choice(n, 50, replace=False), &#039;Salary'] = np.nan

pipeline = MLPreprocessingPipeline()
X_train, X_test, y_train, y_test = pipeline.fit_transform(
    df, target_col=&#039;Left', cat_cols=['Dept']
)

MCQs

Question 1

Sharpe Ratio measures?

Question 2

np.cumprod(1 + returns) simulates?

Question 3

HR attrition rate is?

Question 4

SimpleImputer(strategy='median') fills with?

Question 5

drop_first=True in get_dummies prevents?

Question 6

thresh=int(len(df.columns)*0.5) in dropna keeps rows with?

Question 7

pd.crosstab(normalize='index') shows?

Question 8

Max drawdown measures?

Question 9

fit_transform vs transform on scaler for test data?

Question 10

OOP for data science projects enables?

Interview Questions

  • Q: Design a complete data preprocessing pipeline for a churn prediction model.
  • Q: How would you build a financial dashboard using Pandas time series features?

Course Complete! 🎉

text
12345678910111213141516171819202122232425262728293031323334353637
You have mastered the complete Pandas & NumPy course:

NumPy Foundation:
✅ ndarrays — creation, shapes, dtypes, indexing
✅ Vectorized operations — arithmetic, broadcasting, ufuncs
✅ Statistical & math functions
✅ Random module — simulations, sampling
✅ Advanced — structured arrays, memmap, strides

Pandas Mastery:
✅ Series & DataFrame — creation, attributes, access
✅ File I/O — CSV, Excel, JSON, Parquet, SQL
✅ Selection — loc, iloc, boolean, query
✅ Cleaning — duplicates, formatting, dtype fixing
✅ Missing data — detection, removal, imputation
✅ Transformation — sort, apply, map, string ops
✅ GroupBy & aggregation — agg, transform, pivot_table
✅ Merging — concat, merge (all join types)
✅ Time series — datetime, resample, rolling
✅ Visualization — line, bar, histogram, scatter

Professional Skills:
✅ EDA workflow
✅ Statistical analysis
✅ Large dataset handling
✅ SQL integration
✅ ML preprocessing
✅ Performance optimization
✅ Interview preparation
✅ 6 real-world projects

Your next steps:
→ Scikit-learn for machine learning models
→ Matplotlib/Seaborn for advanced visualization
→ Plotly for interactive dashboards
→ PySpark for big data
→ Kaggle competitions to apply skills

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·