Case Study: Naive Bayes Spam Classifier
1. What is Naive Bayes?
Before the era of Transformers (BERT, GPT), the internet was saved from drowning in spam by a simple, elegant algorithm: Naive Bayes. Even today, it remains a baseline for text classification due to its speed and interpretability.
In this case study, we will design a Spam Filtering System that classifies emails as either Spam (Junk) or Ham (Legitimate).
Try it yourself: Open your email’s “Spam” folder. Notice how it catches “Urgent: You won $1,000,000” but lets “Meeting at 5pm” through. We will build the math engine behind this.
2. Requirements & Goals
Functional Requirements
- Classify: Given an email text, predict
P(Spam | Text). - Explain: The system must be interpretable (Why was this marked as spam?).
- Learn: The system should update its probabilities as new spam trends emerge.
Non-Functional Requirements
- Latency: Inference must be under 10ms per email (Real-time).
- Precision: We must minimize False Positives. Marking a legitimate job offer as spam is worse than letting a spam email through.
- Scalability: Must handle millions of emails per second.
3. Capacity Estimation
- Traffic: 1 Billion emails/day ≈ 12,000 emails/sec.
- Vocabulary: English has ~170,000 words. We might track top 50,000 features.
- Storage:
- We store counts
Count(Word | Spam)andCount(Word | Ham). - 50,000 words × 2 classes × 4 bytes (int) ≈ 400 KB.
- Conclusion: The entire model fits in L1 Cache! This is why Naive Bayes is blazingly fast.
- We store counts
4. System APIs
We define a simple REST API for our inference service.
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/classify |
Returns spam_score and verdict for a given text. |
POST |
/v1/feedback |
User marks an email as “Not Spam” (Updates counters). |
// Response Example
{
"verdict": "SPAM",
"confidence": 0.998,
"top_features": ["winner", "free", "cash"]
}
5. Database Design
For training, we need to store the raw counts. A simple Key-Value store (Redis) or Columnar Store (Cassandra) works well.
Schema: WordCounts
| Key | Value (Spam) | Value (Ham) |
| :— | :— | :— |
| word:money | 14,050 | 200 |
| word:meeting | 5 | 8,900 |
We also need global counters: TotalSpamEmails and TotalHamEmails to calculate Priors.
6. High-Level Design
- Client sends email to Load Balancer.
- Inference Service tokenizes text and queries the Model Cache (In-Memory).
- Model Cache holds the pre-calculated Log-Probabilities.
- Feedback Loop: When users click “Report Spam”, events are sent to a Kafka topic.
- Training Worker consumes Kafka, updates DB, and recalculates Log-Probs to update the Cache.
7. Component Design: The Algorithm
Our Goal: Calculate the probability that a message is Spam given its words w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>.
7.1 Bayes’ Theorem
P(Spam Message) = [ P(Message Spam) · P(Spam) ] / P(Message)
We can ignore the denominator P(Message) because it is the same for both Spam and Ham calculations. We just compare the numerators.
7.2 The “Naive” Independence Assumption
Calculating P(w<sub>1</sub>, w<sub>2</sub> ... | Spam) directly is impossible (we’d need a dataset with every possible sentence).
So, we assume word independence:
P(w1, w2 … Spam) ≈ P(w1 Spam) · P(w2 Spam) · … · P(wn Spam)
This is “Naive” because “Artificial” and “Intelligence” are clearly correlated, but we treat them as independent coin flips.
7.3 Engineering Challenge: Arithmetic Underflow
Probabilities are small numbers (e.g., 0.0001). Multiplying hundreds of them results in numbers so small that computers round them to zero.
Solution: Work in Log Space.
log(a · b) = log(a) + log(b)
Our score becomes a linear sum:
Score = log(P(Spam)) + ∑ log(P(wi Spam))
8. Reliability & Edge Cases
The Zero Probability Problem
What if the word “Bitcoin” never appeared in our Spam training set?
P("Bitcoin" | Spam) = 0.
log(0) = -∞.
The entire sum becomes -∞, and the model crashes.
Solution: Laplace Smoothing
We add a small count (usually α=1) to every word count.
P(w C) = (count(w, C) + 1) / (count(C) + V)
Where V is the vocabulary size. This ensures no probability is ever strictly zero.
9. System Walkthrough (Dry Run)
Let’s trace the email: “Win Money Now”
Priors:
P(Spam) = 0.4→log(0.4) = -0.92P(Ham) = 0.6→log(0.6) = -0.51
Likelihoods (Log-Probs):
- “Win”:
P(w|S)=-2.3,P(w|H)=-4.6(Spammy) - “Money”:
P(w|S)=-2.1,P(w|H)=-5.1(Spammy) - “Now”:
P(w|S)=-3.0,P(w|H)=-3.0(Neutral)
Calculation:
- Spam Score:
-0.92 + (-2.3) + (-2.1) + (-3.0) = -8.32 - Ham Score:
-0.51 + (-4.6) + (-5.1) + (-3.0) = -13.21
Decision:
Since -8.32 > -13.21, we classify as SPAM.
10. Interactive Visualizer: The Log-Likelihood Tug of War
Type a message below to see how the model “fights” to classify it.
- Red Bars: Push score towards SPAM (Right).
- Green Bars: Push score towards HAM (Left).
- The Rope: The cumulative score. If it crosses the center line, the verdict flips.
11. Summary
- Naive Bayes is the “Hello World” of NLP, but powerful enough for production baselines.
- Log-Space Arithmetic is mandatory for probabilistic ML to prevent underflow.
- Laplace Smoothing is the standard fix for the “Zero Frequency” bug.