Skip to main content
AI Ethics Tutorial
CHAPTER 05 Beginner

Fairness in AI

Updated: May 14, 2026
20 min read

# CHAPTER 5

Bias and Fairness in AI Systems

1. Introduction

"Computers are objective; they just do math!" This is the greatest myth in the technology sector. While the mathematical formulas inside a neural network are objective, the data fed into those formulas is generated by a deeply flawed, historically biased human society. In this chapter, we will explore Algorithmic Bias—how machines learn human prejudices, why it causes massive societal harm, and the technical steps engineers must take to ensure Fairness.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Define Algorithmic Bias and understand its root causes.
  • Identify real-world examples of biased AI systems.
  • Explain the concept of "Proxy Variables" and hidden discrimination.
  • Understand techniques for auditing and reducing bias in datasets.

3. Beginner-Friendly Explanation

Imagine a father teaching his child what a "Doctor" is. Every time they watch TV, read a book, or visit a hospital, the father only points to men and says, "Doctor." Eventually, the child learns the pattern: Doctor = Man. If a woman walks into the room wearing a stethoscope, the child will be confused and say, "That's not a doctor." The child isn't inherently malicious; the child's *training data* was biased. An AI model is the child. If you train an image generator on millions of historical photos from the 1950s, the AI will mathematically learn that CEOs are men, nurses are women, and criminals look a certain way. It simply regurgitates the biases of its training data.

4. Types of Bias in AI

Bias enters AI systems in several ways:
  • Historical Bias: The data is perfectly accurate, but reality is flawed. (e.g., An AI trained on 50 years of Fortune 500 CEO data will predict that CEOs should be men named 'John', because historically, that is a factual pattern).
  • Representation Bias: The dataset does not include enough minority examples. (e.g., A facial recognition dataset containing 90% light-skinned faces and 10% dark-skinned faces will fail to recognize dark-skinned users).
  • Measurement Bias: The way data is collected is flawed. (e.g., An AI predicting crime hotspots is trained on arrest records. Since police patrol low-income neighborhoods more heavily, more arrests happen there, causing a feedback loop where the AI constantly sends police to the same neighborhoods).

5. The Danger of Proxy Variables

Sometimes engineers explicitly remove race and gender from the dataset to make it "fair." *However, the AI still discriminates! How?* Through Proxy Variables. Even if you delete the "Race" column, the AI looks at the "Zip Code" column. Because neighborhoods are often historically segregated by demographics, "Zip Code" acts as a proxy for race. The AI mathematically discovers the pattern and discriminates based on the zip code, completely bypassing the engineer's attempt at fairness.

6. Real-World Case Study: Healthcare Bias

In 2019, a massive health algorithm used in US hospitals to predict which patients needed extra medical care was found to be heavily biased against Black patients. The algorithm used "past healthcare spending" as a proxy for "sickness." Because Black patients historically had less access to healthcare and spent less money on it, the AI mathematically concluded they were "healthier," and prioritized white patients for extra care, even when the Black patients were objectively sicker.

7. Reducing Bias (Technical Solutions)

How do engineers fix this?
  1. 1. Data Auditing: Before training, statistically analyze the dataset. If it is 80% Male and 20% Female, you must manually balance it (collect more female data or downsample the male data) until it is 50/50.
  1. 2. Algorithmic Fairness Constraints: Writing code that penalizes the AI during training if it generates different accuracy rates across different demographic groups.
  1. 3. Continuous Monitoring: Bias creeps in over time. Models must be audited monthly in production.

8. Pseudocode: Checking for Representation Bias

Engineers write scripts to validate the fairness of their data *before* training.
text
123456789101112131415
// Concept: Dataset Demographics Auditor

Function Audit_Dataset(training_images):
    light_skin_count = count_demographic(training_images, "light_skin")
    dark_skin_count = count_demographic(training_images, "dark_skin")
    
    total = light_skin_count + dark_skin_count
    ratio = dark_skin_count / total
    
    If ratio < 0.45 or ratio > 0.55:
        print("CRITICAL ERROR: Dataset is demographically skewed.")
        print("Action Required: Collect more diverse data before training.")
        HALT_TRAINING()
    Else:
        print("Dataset passes Representation Check. Proceed.")

9. Mini Project

Spot the Proxy: You are building an AI to approve auto loans. To be fair, you delete the columns for "Race", "Gender", and "Age" from your database. However, you leave the following columns: Income, Length of Employment, Music Streaming Preferences, and Years of Education. Which of these remaining columns could accidentally act as a "Proxy Variable" for Age, causing the AI to discriminate against young people? *(Answer: 'Length of Employment' and 'Years of Education'. A 20-year-old physically cannot have 15 years of employment or a PhD. The AI will use these fields to figure out the applicant is young and deny them the loan).*

10. Best Practices

  • Diverse Engineering Teams: The most effective bias mitigation tool is a diverse team. An all-male engineering team might not realize their voice-recognition software struggles to hear high-pitched female voices until after the product is launched. Diverse teams catch blind spots early.

11. Common Mistakes

  • The "Colorblind" Approach: Assuming that simply deleting the "Race" or "Gender" columns from a database will make the algorithm fair. As we saw with Proxy Variables, this never works. You must actively audit the *outputs* of the model across demographics, not just hide the inputs.

12. Exercises

  1. 1. Explain the difference between Representation Bias (flawed data collection) and Historical Bias (flawed reality).

13. MCQs with Answers

Question 1

What is the root cause of Algorithmic Bias in Machine Learning systems?

Question 2

An engineer deletes the "Race" and "Gender" columns from a dataset, but the AI still discriminates based on the user's "Zip Code." What is "Zip Code" acting as in this scenario?

14. Interview Questions

  • Q: Explain how a seemingly objective data point, like "past healthcare spending," can result in severe racial bias when used to predict a patient's medical needs.
  • Q: What steps would you take as a Data Scientist to ensure your training dataset does not suffer from Representation Bias?

15. FAQs

Q: Can we ever build a 100% perfectly fair AI? A: Probably not, because human philosophers cannot even agree on the definition of "Fairness." Does fairness mean everyone gets the exact same outcome, or does it mean everyone gets the exact same opportunity? AI Ethics is about minimizing harm and maximizing equity, but mathematical perfection is likely impossible.

16. Summary

In Chapter 5, we destroyed the myth of the "objective machine." AI models are mirrors reflecting human history. If our historical data is plagued by racism, sexism, and inequality, our AI models will learn, amplify, and automate those prejudices. By understanding proxy variables and strictly auditing our datasets for representation, ethical engineers can break this cycle and build algorithms that are fairer than the humans who programmed them.

17. Next Chapter Recommendation

We know AI can be biased. But how do we find out *why* it made a biased decision? Proceed to Chapter 6: AI Transparency and Explainability to look inside the Black Box.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·