Chapter 12 of 20

Naive Bayes

Learn how Naive Bayes turns Bayes theorem into a fast probabilistic classifier — the conditional-independence assumption, the GaussianNB, MultinomialNB and BernoulliNB variants, a full spam-detection pipeline with CountVectorizer, and Laplace smoothing.

Meritshot16 min read
Machine LearningNaive BayesBayes TheoremText ClassificationProbabilityScikit-Learn
All Machine Learning Chapters

What Is Naive Bayes?

Naive Bayes is a family of probabilistic classifiers built directly on Bayes' theorem. Instead of drawing a boundary in feature space like Logistic Regression or K-Nearest Neighbors, it asks a very natural question: given everything I can observe about this example, which class is most probable? It then picks that class.

It is called "naive" because it makes one bold simplifying assumption — that every feature is conditionally independent of every other feature, given the class. That assumption is almost never literally true in the real world, yet the classifier works remarkably well, trains in a single pass over the data, and needs very little tuning. This combination of speed, simplicity, and surprisingly strong accuracy is why Naive Bayes remains the go-to baseline for text classification: spam filters, sentiment tagging, topic labelling, and news categorisation.

Intuitive analogy. Imagine Priya sorting her incoming email into "spam" and "not spam." She doesn't reason about the whole message at once. She glances at individual clues — the word "lottery," the word "invoice," a suspicious link, the sender's name — and each clue independently nudges her belief. "Lottery" pushes toward spam; "invoice from my accountant" pushes toward legitimate. She combines all those nudges and goes with whichever verdict the evidence favours most. Naive Bayes formalises exactly this: it treats each word as a separate piece of evidence, multiplies their effects together, and returns the class with the highest combined probability.

Goal: compute the probability of each class given the observed features, and predict the class with the maximum posterior probability.

Examples:
→ Classify an email as spam or ham (the classic use case)
→ Tag a product review as positive / negative / neutral
→ Route a support ticket to the right department by its text
→ Detect the language of a short piece of text

Bayes' Theorem: The Foundation

Everything starts with Bayes' theorem, which tells us how to update a belief after seeing evidence. Written with a class C and observed features X:

Bayes' theorem:

           P(X | C) · P(C)
P(C | X) = ---------------
                P(X)

Where:
P(C | X) = POSTERIOR  — probability of class C given the features X
P(X | C) = LIKELIHOOD — probability of seeing features X within class C
P(C)     = PRIOR      — how common class C is before seeing any features
P(X)     = EVIDENCE   — probability of the features under any class

For classification we compare the posterior across every class and keep the winner. Because the denominator P(X) is the same for all classes, it doesn't change which class is largest — so we can drop it and work with a proportionality:

P(C | X) ∝ P(X | C) · P(C)

Predicted class = argmax over C of  [ P(X | C) · P(C) ]

That is the whole idea. The only hard part is estimating the likelihood P(X | C) when X is a whole vector of features. That is where the "naive" assumption rescues us.

The "Naive" Conditional-Independence Assumption

Suppose an example has features X = (x₁, x₂, ..., xₙ). In general the joint likelihood P(x₁, x₂, ..., xₙ | C) is impossible to estimate reliably — with many features there are astronomically many combinations, and we would never have enough data to see each one.

The naive assumption cuts this knot: assume every feature is conditionally independent of the others given the class. That lets us break the joint likelihood into a simple product of per-feature terms:

Naive conditional-independence assumption:

P(x₁, x₂, ..., xₙ | C) = P(x₁ | C) · P(x₂ | C) · ... · P(xₙ | C)
                       = ∏  P(xᵢ | C)

So the full decision rule becomes:

Predicted class = argmax over C of  [ P(C) · ∏ P(xᵢ | C) ]

Now each P(xᵢ | C) is a one-dimensional quantity we can estimate easily by counting or by fitting a simple distribution.

Why It Still Works Despite Being Wrong

In real data, features are correlated — in an email, the word "free" and the word "prize" tend to appear together, so they are not truly independent. So why does Naive Bayes still classify well?

  • Classification only needs the ranking, not the exact probability. Even when the assumption distorts the magnitude of P(C | X), it often preserves which class is largest. The argmax survives even when the numbers are miscalibrated.
  • Correlated features push in the same direction. They may over-count evidence, but usually toward the correct class, so the winner is unchanged.
  • Low variance beats low bias here. With few parameters to estimate, Naive Bayes is stable on small datasets where more flexible models overfit.

A practical caveat: because probabilities get double-counted, the predicted probabilities themselves are often poorly calibrated (pushed toward 0 or 1). Trust the predicted class more than the raw predict_proba numbers.

Working the Numbers by Hand

A tiny worked example makes the mechanics concrete. Suppose we classify weather as "Play" or "No Play" using one feature, Outlook.

Training counts:
                 Play=Yes   Play=No
Outlook=Sunny        2         3
Outlook=Overcast     4         0
Outlook=Rainy        3         2
Totals               9         5   → 14 examples

Priors:
P(Yes) = 9/14 = 0.643
P(No)  = 5/14 = 0.357

New day: Outlook = Sunny. Which class?

Likelihoods:
P(Sunny | Yes) = 2/9 = 0.222
P(Sunny | No)  = 3/5 = 0.600

Posteriors (proportional):
P(Yes | Sunny) ∝ 0.222 × 0.643 = 0.143
P(No  | Sunny) ∝ 0.600 × 0.357 = 0.214

argmax → No Play (0.214 > 0.143)

Normalised: P(No | Sunny) = 0.214 / (0.214 + 0.143) = 0.60

With more features you would multiply in one P(xᵢ | C) term per feature before taking the argmax.

Working in Log Space

Multiplying many small probabilities together quickly underflows to zero in floating-point arithmetic — a 500-word email would multiply 500 numbers, each less than 1. The standard fix is to work with the logarithm of the posterior. Because log is monotonic, the argmax is unchanged, and products turn into sums:

log P(C | X) ∝ log P(C) + Σ log P(xᵢ | C)

Multiplying probabilities  →  adding log-probabilities (numerically safe)

Every scikit-learn Naive Bayes estimator does this internally, which is why it stays stable even on long documents.

The Three Main Variants

The variants of Naive Bayes differ only in how they model P(xᵢ | C) — the likelihood of a single feature. Choosing the right one for your data type is the single most important decision.

GaussianNB — Continuous Features

GaussianNB assumes each continuous feature follows a normal (Gaussian) distribution within each class. During training it just estimates the mean and variance of each feature per class; at prediction time it plugs the feature value into the Gaussian density.

For a continuous feature x in class C with mean μ and variance σ²:

               1                (x − μ)²
P(x | C) = ---------- · exp( − ---------- )
           √(2π σ²)              2 σ²

Use for: numeric measurements — height, income, sensor readings, petal length.

MultinomialNB — Counts (Text)

MultinomialNB models feature values as counts — how many times each word (or token) appears. It is the workhorse of text classification, pairing naturally with word-count or TF-IDF features. P(word | C) is essentially the fraction of all words in class C that are this word.

P(wordᵢ | C) = (count of wordᵢ in class C + α) / (total words in class C + α·V)

V = vocabulary size, α = smoothing constant (see below)

Use for: bag-of-words / TF-IDF text features, and any non-negative count data.

BernoulliNB — Binary Features

BernoulliNB models each feature as a boolean present/absent flag. Unlike MultinomialNB it explicitly accounts for words that are absent — the non-occurrence of a word is itself evidence. It shines on short texts (tweets, SMS, subject lines) where whether a word appears matters more than how often.

Each feature is 0/1. For each word the model estimates P(word present | C),
and the absence of an expected word actively lowers a class's score.

Use for: binary/boolean features, short documents, presence-absence signals.

Choosing a Variant

VariantFeature typeTypical useModels absence?
GaussianNBContinuous, real-valuedSensor data, measurements, mixed numeric tablesN/A
MultinomialNBNon-negative counts / frequenciesLonger documents, bag-of-words, TF-IDFNo
BernoulliNBBinary present/absentShort texts, SMS, subject lines, boolean flagsYes

Laplace / Additive Smoothing (alpha)

Here is a subtle but fatal problem. Suppose the word "meritshot" never appeared in any spam email during training. Then P("meritshot" | spam) = 0. Because the likelihoods are multiplied, a single zero wipes out the entire product — the email can never be classified as spam no matter how many other spammy words it contains. This is the zero-frequency problem.

Laplace smoothing (also called additive smoothing) fixes it by pretending we saw every feature at least a little. We add a small constant α to every count:

Smoothed likelihood for a word:

              count(word, C) + α
P(word | C) = ---------------------
              total(words in C) + α · V

α  = smoothing strength (α = 1 is classic "Laplace"; α < 1 is "Lidstone")
V  = vocabulary size (number of distinct features)

α = 1   → strong smoothing, never zero
α → 0   → little smoothing, closer to raw counts (risky: zeros reappear)

In scikit-learn this is the alpha hyperparameter, and it is essentially the only knob you tune for the text variants. The default alpha = 1.0 is a sensible starting point; tune it with cross-validation (values roughly in the range 0.01 to 1.0 are common). Setting alpha = 0 disables smoothing and reopens the zero-frequency problem.

A Complete Spam-Detection Pipeline

Let's build the classic example end to end: classify SMS messages as spam or ham (legitimate). We turn raw text into word counts with CountVectorizer, then feed those counts to MultinomialNB. Using a scikit-learn Pipeline keeps the vectoriser and the classifier bundled so the same text transformation is applied to training and to new messages.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# A small illustrative dataset (in practice load a CSV of thousands of messages)
messages = [
    "Congratulations you WON a free lottery prize claim now",
    "Free entry win cash click this link urgently",
    "URGENT your account has won a cash reward call now",
    "Claim your free ringtone reply WIN to 80085",
    "Hey Rahul are we still meeting for lunch at 1pm",
    "Please review the attached invoice from the accounts team",
    "Priya sent the project report can you check it tonight",
    "Reminder your electricity bill of Rs 1450 is due tomorrow",
]
labels = ["spam", "spam", "spam", "spam", "ham", "ham", "ham", "ham"]

X_train, X_test, y_train, y_test = train_test_split(
    messages, labels, test_size=0.25, random_state=42, stratify=labels
)

# Pipeline: raw text -> word-count matrix -> Multinomial Naive Bayes
model = Pipeline([
    ("vectorizer", CountVectorizer(lowercase=True, stop_words="english")),
    ("classifier", MultinomialNB(alpha=1.0)),
])

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Illustrative output (small toy dataset — real numbers depend on the full corpus):

[[1 0]
 [0 1]]

              precision    recall  f1-score   support
         ham       1.00      1.00      1.00         1
        spam       1.00      1.00      1.00         1
    accuracy                           1.00         2

Once fitted, classifying a brand-new message is instant:

new_messages = [
    "You have won a free prize, claim your reward now!",
    "Hi Priya, can you share the sales report by evening?",
]
print(model.predict(new_messages))
print(model.predict_proba(new_messages).round(3))
['spam' 'ham']
[[0.041 0.959]      # P(ham), P(spam) for message 1  -> leans spam
 [0.972 0.028]]     # for message 2                  -> leans ham

CountVectorizer vs TfidfVectorizer

CountVectorizer produces raw word counts. TfidfVectorizer re-weights those counts by TF-IDF, which downweights words that appear in almost every document (like "the") and upweights rarer, more distinctive words. Swapping one for the other is a one-line change in the pipeline:

from sklearn.feature_extraction.text import TfidfVectorizer

model = Pipeline([
    ("vectorizer", TfidfVectorizer(lowercase=True, stop_words="english")),
    ("classifier", MultinomialNB(alpha=0.5)),
])

TF-IDF often gives a small accuracy bump for longer documents; for very short messages plain counts (or BernoulliNB on binary presence) can be just as good. Try both with cross-validation — see the Train-Test Split & Cross-Validation chapter.

A Continuous-Feature Example with GaussianNB

When your features are numeric measurements rather than text, reach for GaussianNB. Here we classify iris flowers from their petal and sepal measurements.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)   # 4 continuous features, 3 classes
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

print("Accuracy:", round(accuracy_score(y_test, y_pred), 3))
Accuracy: 0.956    # illustrative — GaussianNB is a strong, fast baseline here

Notice there is no alpha here — smoothing is a counting concern, so it applies to the text variants, not to GaussianNB, which instead estimates a mean and variance per feature per class. (GaussianNB does expose a tiny var_smoothing parameter that stabilises variance estimates, but you rarely touch it.)

Pros and Cons

AspectDetails
SpeedExtremely fast to train and predict — a single pass over the data, closed-form estimates
Data efficiencyWorks well even with small training sets and very high-dimensional text
ScalabilityHandles tens of thousands of features (words) with ease; supports partial_fit for streaming
SimplicityAlmost no tuning — really just alpha for the text variants
Baseline valueAn excellent first model to beat before trying heavier algorithms
Independence assumptionNaive assumption is usually violated; strongly correlated features get over-counted
Probability calibrationPredicted probabilities are often pushed toward 0 or 1 — trust the class, not the raw score
Feature-type sensitivityMust match the variant to the data (Gaussian for continuous, Multinomial/Bernoulli for text)
ExpressivenessCannot learn feature interactions; a boundary needing correlated features will elude it

When to use it: text and document classification, spam filtering, real-time or streaming classification, and any time you need a fast, strong baseline. When to look elsewhere: when feature interactions clearly matter, or when you need well-calibrated probabilities — then consider Support Vector Machines, Random Forests, or ensemble methods.

Common Mistakes

1. Using GaussianNB on non-continuous data

GaussianNB assumes each feature is normally distributed. Feeding it word counts, one-hot categories, or heavily skewed values violates that assumption and quietly hurts accuracy. Use MultinomialNB for counts, BernoulliNB for binary flags, and reserve GaussianNB for genuinely continuous numeric measurements.

2. Disabling smoothing and hitting the zero-frequency trap

Setting alpha = 0 (or forgetting smoothing entirely) means any feature never seen with a class gets likelihood zero, which zeroes the entire product. One unseen word can make a class impossible. Always keep alpha greater than 0 — the default alpha = 1.0 is a safe start.

3. Throwing strongly correlated features at it

The naive assumption over-counts correlated evidence. If you include the raw text column and ten hand-crafted features derived from that same text, they all pull in the same direction and inflate one class's score. Drop redundant features or use a model that handles correlation.

4. Fitting the vectoriser on the full dataset before splitting

Calling CountVectorizer().fit_transform(all_data) and then splitting leaks vocabulary statistics from the test set into training — data leakage. Always fit the vectoriser inside a Pipeline (or on the training fold only) so the test set stays truly unseen.

5. Trusting predict_proba as a true confidence

Because likelihoods are multiplied under a false independence assumption, Naive Bayes probabilities are usually mis-calibrated and cluster near 0 or 1. Use predict_proba for ranking if you must, but don't read 0.98 as "98% sure." Calibrate with CalibratedClassifierCV if you need honest probabilities.

6. Mismatching training and prediction text preprocessing

If you lowercase and strip stop-words during training but skip that on new inputs, the token counts won't line up and predictions degrade. Bundling the vectoriser and classifier in a single Pipeline guarantees identical preprocessing on both sides.

Practice Exercises

  1. By hand. Using the weather table from earlier, compute P(Yes | Overcast) and P(No | Overcast). Which class wins, and why does the zero count for No, Overcast make smoothing important here?

  2. Smoothing sweep. Take the spam pipeline and loop alpha over [0.01, 0.1, 0.5, 1.0, 2.0], using cross-validation to record accuracy for each. Which value works best on your data, and what happens as alpha grows large?

  3. Count vs TF-IDF. Build two pipelines — one with CountVectorizer, one with TfidfVectorizer — on the same messages. Compare their cross-validated F1 scores and explain any difference.

  4. Variant match-up. On the same short-message dataset, compare MultinomialNB (with CountVectorizer) against BernoulliNB (with CountVectorizer(binary=True)). Which variant does better on short texts, and why does modelling word absence help?

  5. Right tool check. You are given a table of customers with continuous features (age, monthly spend in ₹, tenure in months) and a churn label. Which Naive Bayes variant is appropriate, and what would go wrong if you used MultinomialNB instead?

  6. Calibration probe. Fit MultinomialNB on the spam data, print predict_proba for several test messages, and discuss whether the confidence values look realistic given the independence assumption.

Summary

In this chapter you learned:

  • Naive Bayes is a probabilistic classifier that applies Bayes' theoremP(C | X) ∝ P(X | C) · P(C) — and predicts the class with the highest posterior.
  • The naive assumption treats features as conditionally independent given the class, turning the joint likelihood into a product: P(X | C) = ∏ P(xᵢ | C).
  • It still works well because classification only needs the ranking of classes, not perfectly calibrated probabilities.
  • Computations run in log space (sums of log probabilities) to avoid numerical underflow on long documents.
  • Three variants match three data types: GaussianNB for continuous features, MultinomialNB for counts/text, and BernoulliNB for binary present/absent features.
  • Laplace / additive smoothing (alpha) adds a small count to every feature so unseen features don't zero out the whole product — never set alpha = 0.
  • The classic pipeline is CountVectorizer (or TfidfVectorizer) chained into MultinomialNB, giving a blazing-fast, strong text-classification baseline.
  • Watch out for using Gaussian on non-continuous data, disabling smoothing, correlated features, vectoriser data leakage, and over-trusting predict_proba.

Naive Bayes gives you a fast, dependable baseline — especially for text — that any fancier model has to earn its keep against.

Next up: Support Vector Machines (SVM) — a geometric classifier that finds the maximum-margin boundary between classes and, via the kernel trick, bends that boundary into powerful non-linear shapes.