Skip to main content
Classification Algorithms
CHAPTER 11 Intermediate

Naive Bayes Classification

Updated: May 16, 2026
6 min read

# CHAPTER 11

Naive Bayes Classification

1. Introduction

Every algorithm we have used so far relies on drawing physical lines or geometric boundaries through data points. But how do you draw a line through a text document? You can't. Text requires a completely different approach. Naive Bayes abandons geometry entirely and relies purely on Probability. It is the algorithm behind the world's first successful spam filters, and it remains one of the fastest, most effective tools for Natural Language Processing (NLP).

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Understand the basics of Bayes' Theorem.
  • Explain why the algorithm is called "Naive".
  • Train a MultinomialNB model for text classification.
  • Train a GaussianNB model for continuous numerical data.
  • Build a basic Sentiment Analysis system.

3. Bayes' Theorem Simplified

Bayes' Theorem is a mathematical formula for calculating Conditional Probability: *What is the probability of an event happening, given that another event has already happened?*

In machine learning, we ask: *What is the probability that this email is SPAM, given that it contains the word "VIAGRA"?*

  • The algorithm calculates this by looking at historical data: Out of all the past emails that contained the word "VIAGRA", what percentage were Spam? If it was 99%, the model confidently predicts Spam.

4. Why is it "Naive"?

If an email contains the phrase "Free Money", a human knows those two words belong together. The Naive Bayes algorithm is "Naive" because it assumes every single word (feature) is completely independent of every other word. It assumes the word "Free" has absolutely zero relationship to the word "Money". Despite this mathematically flawed assumption, the algorithm works astonishingly well in reality!

5. Gaussian vs. Multinomial

Scikit-learn offers different versions of Naive Bayes depending on your data:
  • Gaussian Naive Bayes (GaussianNB): Use this when your features are continuous decimal numbers (like Height, Weight, Salary). It assumes the data follows a bell-curve (Normal Distribution).
  • Multinomial Naive Bayes (MultinomialNB): Use this when your features are discrete counts (like the number of times a word appears in an email). This is the absolute standard for Text Classification.

6. Mini Project: Sentiment Analysis System

Let's build a simple MultinomialNB model to classify movie reviews as Positive (1) or Negative (0). *Note: Algorithms cannot read text. We must convert the text into numerical word counts using a CountVectorizer first.*
python
123456789101112131415161718192021222324252627
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# 1. Historical Text Data
reviews = [
    "I loved this movie it was great", # Positive
    "Best film of the year",           # Positive
    "Terrible movie complete garbage", # Negative
    "I hated the acting it was bad"    # Negative
]
y_train = [1, 1, 0, 0] # 1=Positive, 0=Negative

# 2. Build the Pipeline
# CountVectorizer converts the sentences into mathematical matrices of word counts
# MultinomialNB calculates the probabilities
model = make_pipeline(CountVectorizer(), MultinomialNB())

# 3. Train the Model
model.fit(reviews, y_train)

# 4. Make a Prediction!
new_review = ["The movie was totally terrible and bad"]
prediction = model.predict(new_review)

print(f"Predicted Sentiment: {'Positive' if prediction[0] == 1 else 'Negative'}")
# Output: Predicted Sentiment: Negative

7. The Zero-Frequency Problem (Laplace Smoothing)

What if the new review contains the word "Atrocious", but the model has *never* seen that word during training? Mathematically, the probability of "Atrocious" being Spam is 0/0. Because Naive Bayes multiplies probabilities together, multiplying by 0 will destroy the entire equation, ruining the prediction. The Fix: Scikit-learn automatically applies *Laplace Smoothing* (alpha=1.0), which adds a baseline count of "1" to every possible word, ensuring the math never hits exactly zero.

8. Common Mistakes

  • Using GaussianNB for Text: If you pass text frequency counts into GaussianNB, the mathematical assumptions will break down, resulting in terrible accuracy. Always match the algorithm to the data type.
  • Using Naive Bayes for complex numerical patterns: While NB is incredible for text, it is generally outperformed by Random Forests and Logistic Regression on standard tabular (CSV) data because the "naive" assumption of feature independence is rarely true in finance or healthcare.

9. Best Practices

  • Text Classification Baseline: Whenever you face an NLP problem (Spam, Sentiment, Topic Categorization), ALWAYS run a CountVectorizer + MultinomialNB pipeline first. It takes 3 lines of code, trains in milliseconds, and often hits 90%+ accuracy without any tuning.

10. Exercises

  1. 1. According to the "Naive" assumption in Naive Bayes, what is the relationship between the features in the dataset?
  1. 2. Which variant of Naive Bayes should be used for predicting categories based on text frequency counts?

11. MCQ Quiz with Answers

Question 1

Why is the Naive Bayes algorithm called "Naive"?

Question 2

To train a Naive Bayes model on raw English sentences, what must be done to the text first?

12. Interview Questions

  • Q: Explain Bayes' Theorem in the context of a Spam Filter. What exactly is the model calculating?
  • Q: What is Laplace Smoothing, and why is it mathematically required in a Multinomial Naive Bayes text classifier?

13. FAQs

Q: Is Naive Bayes still used today, or has Deep Learning replaced it? A: Deep Learning (Transformers/LLMs) is far more accurate for complex NLP. However, Naive Bayes requires 1/1000th of the computing power, trains in milliseconds, and requires very little data. It is still heavily used for high-speed, lightweight text filtering.

14. Summary

By stepping away from geometric boundaries and embracing the laws of probability, Naive Bayes offers a uniquely fast and scalable approach to classification. While its "naive" assumption of feature independence makes it less ideal for complex tabular data, it remains the undisputed king of baseline Natural Language Processing.

15. Next Chapter Recommendation

We have learned single algorithms like Logistic Regression, SVM, and Naive Bayes. But the modern AI industry rarely uses single algorithms. In Chapter 12: Ensemble Learning and Boosting, we will learn how to combine hundreds of models together to build Kaggle-winning superpowers like Gradient Boosting and AdaBoost.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·