Speaking the Language of Probability-From Classifier Evaluation to Generative Classifiers

AI_Tricks

Tech

Publish Date: 2025-10-30

Update Date: 2025-10-30

Word Count: 5k

Read Times: 31 Min

Read Count:

Introduction

Today’s talk is structured into four major parts:

How to evaluate how good a classifier is (Section 5.2.5 “Classifier accuracy”)
We’ll introduce the confusion matrix, accuracy, precision, recall, FPR, and related metrics, and we’ll explain why focusing on “accuracy” alone is often dangerous.
How to interpret these metrics through a decision threshold, and how to compare classifiers globally using the ROC curve
We’ll use Figure 5.10 and Figure 5.11 to build a continuous logical chain:
“decision threshold ↔ false-positive / false-negative area ↔ ROC curve.”
How to derive a classifier from probabilistic modeling (Section 5.3 “Generative classifiers”)
We’ll derive posterior probabilities via Bayes’ rule and show how the sigmoid function (Figure 5.12) and the softmax function emerge naturally.
Continuous inputs and Gaussian class-conditional distributions (Section 5.3.1 “Continuous inputs”)
We’ll explain why, under the assumption that each class generates data from a Gaussian distribution, the posterior probability can be written as a sigmoid / softmax applied to a linear function, which in turn yields a linear decision boundary (Figure 5.13).
We’ll also explain how, when class covariances are no longer shared, the boundary becomes quadratic (Figure 5.14).
A linear boundary corresponds to Linear Discriminant Analysis (LDA), where all classes share the same covariance matrix. A quadratic boundary corresponds to Quadratic Discriminant Analysis (QDA), where each class is allowed to have its own covariance matrix, making the boundaries more flexible but also easier to overfit, especially when data are limited.

Part I: 5.2.5 “Classifier accuracy” — Classifier Accuracy and the Confusion Matrix

1.1 The four possible prediction outcomes: the confusion matrix

In a binary classification task (for example, “diseased vs. healthy”), every prediction made by the model falls into one of four categories:

True Positive (TP): The true label is positive (e.g. diseased), and the model predicts positive.
False Positive (FP): The true label is negative (healthy), but the model predicts positive. This is a false alarm, also called a Type I error.
True Negative (TN): The true label is negative, and the model predicts negative.
False Negative (FN): The true label is positive, but the model predicts negative. This is a missed detection, also called a Type II error.

In the book, these counts are denoted $N_{TP}, N_{FP}, N_{TN}, N_{FN}$.
The total number of samples in the dataset, $N$, is the sum of all four:

$$
N = N_{TP} + N_{FP} + N_{TN} + N_{FN}.
\tag{5.28}
$$

This 2×2 table of counts is called the confusion matrix. It is the raw material from which essentially all common classification performance metrics are defined.

1.2 The definition and limitation of accuracy

The most obvious performance measure is accuracy, i.e. “the fraction of predictions that are correct”:

$$
\text{Accuracy} =
\frac{N_{TP} + N_{TN}}
{N_{TP} + N_{FP} + N_{TN} + N_{FN}}.
\tag{5.29}
$$

Accuracy looks perfectly sensible when (i) the two classes are roughly balanced and (ii) the costs of different kinds of mistakes are symmetric.

But there’s a serious problem: on highly imbalanced datasets, accuracy can be very misleading.

For example, suppose in a population of 1,000 people only 1 person is actually ill (positive class) and 999 are healthy (negative class).
A “bad” classifier could just always predict “healthy” (always negative):

It would correctly classify all 999 healthy people (TN is huge).
It would only make one mistake (the single sick person is missed, FN = 1).
It would produce almost no false positives (FP ≈ 0).

Then the accuracy would be about 999/1000 = 99.9%.
But this system has zero diagnostic value: it fails to identify the only truly ill person.
This kind of scenario is common in medical screening, fraud detection, and security.

Conclusion: relying on accuracy alone is dangerous, especially when the positive class is rare, or when the costs of different errors are very asymmetric (for example, “missing a cancer case” is typically much worse than “unnecessarily flagging a healthy patient for follow-up”).

1.3 Other key metrics: Precision, Recall, FPR, and FDR

We now introduce four metrics that reflect real-world costs and benefits more faithfully.

(1) Precision
$$
\text{Precision} =
\frac{N_{TP}}{N_{TP} + N_{FP}}.
\tag{5.30}
$$

Interpretation: “Among all the cases I predicted as positive, how many are actually positive?”
This is also known as positive predictive value: it answers, “When I tell you ‘you’re positive,’ how trustworthy is that statement?”

(2) Recall / Sensitivity / True Positive Rate (TPR)
$$
\text{Recall} =
\frac{N_{TP}}{N_{TP} + N_{FN}}.
\tag{5.31}
$$

Interpretation: “Out of all truly positive individuals, how many did I successfully catch?”
This is also called the detection rate or sensitivity. In medical diagnosis, recall (a.k.a. TPR) measures how serious your missed diagnoses are: a low recall means you are failing to detect many truly sick patients.

(3) False Positive Rate (FPR)
$$
\text{False positive rate} =
\frac{N_{FP}}{N_{FP} + N_{TN}}.
\tag{5.32}
$$

Interpretation: “Out of all the truly negative individuals, how many did I incorrectly label as positive?”
This is sometimes called the false alarm rate or “$1 -$ specificity.” A low FPR means “don’t frighten healthy people / don’t accuse the innocent.”

(4) False Discovery Rate (FDR)
$$
\text{False discovery rate} =
\frac{N_{FP}}{N_{FP} + N_{TP}}.
\tag{5.33}
$$

Interpretation: “Among all the cases I called positive, what fraction are actually false alarms?”
In many screening and retrieval tasks, FDR measures “how much junk is mixed into my positive detections.”
You can view FDR as the complement of precision: if precision is ‘how clean are my positives,’ FDR is ‘how contaminated are my positives.’

1.4 How the decision threshold affects these metrics: Figure 5.10

To build direct intuition, the book uses a 1D example in Figure 5.10.

Structure of the figure:

The horizontal axis is some continuous measurement $x$, such as the concentration of a biomarker in the blood.
The figure shows two curves:
- $p(x, C_1)$: the joint density of “$x$ and true class $C_1$” in the overall population; in other words, how frequently we see value $x$ coming from class $C_1$.
- $p(x, C_2)$: the joint density of “$x$ and true class $C_2$.”
Visually, you can think of these as two probability “hills,” one for each class.
We draw a vertical line at $ \hat{x} $; this is our current decision threshold:
- Region $R_1 = { x < \hat{x} }$: predict class $C_1$ (e.g. “healthy”).
- Region $R_2 = { x \ge \hat{x} }$: predict class $C_2$ (e.g. “diseased”).

That single threshold $\hat{x}$ is literally how the classifier is making its yes/no decision.

Where errors come from:

In the left region $R_1$, we predict “healthy.” But some truly diseased individuals (true class $C_2$) might still fall there. Those are false negatives (FN) — missed detections. In the figure, these are indicated with specific shaded areas (often red + green shading in the book’s convention).
In the right region $R_2$, we predict “diseased.” But some truly healthy individuals (true class $C_1$) may fall there. Those are false positives (FP) — false alarms. In the figure, these are shown by blue shading.

Area = probability mass = expected count fraction:

The shaded area under a curve corresponds to the probability mass of “that kind of mistake.”
If the population has total size $N$, then (probability mass) × $N$ ≈ (expected count of that error type).

The book labels several subregions of area as $A, B, C, D, E,$ etc., and then writes:

$$
\frac{N_{FP}}{N} = E,
\tag{5.34}
$$

$$
\frac{N_{TP}}{N} = D + E,
\tag{5.35}
$$

$$
\frac{N_{FN}}{N} = B + C,
\tag{5.36}
$$

$$
\frac{N_{TN}}{N} = A + C.
\tag{5.37}
$$

Here’s how to read these:

$E$: the area in the “predict positive” region that actually comes from the negative class → FP frequency.
$D + E$: the area in the “predict positive” region that actually comes from the positive class → TP frequency.
$B + C$: the area in the “predict negative” region that actually comes from the positive class → FN frequency (missed positives).
$A + C$: the area in the “predict negative” region that actually comes from the negative class → TN frequency.

Core intuition:

Figure 5.10 makes you literally see that:

The position of the threshold $\hat{x}$ directly determines which parts of the two class distributions are labeled positive vs. negative.
That, in turn, sets the counts TP / FP / TN / FN.
Which, in turn, determines all the metrics: Accuracy, Precision, Recall, FPR, FDR, etc.

If we move $\hat{x}$ to the left, the classifier becomes more aggressive about calling “positive”:

We’ll catch more real positives (Recall↑, TPR↑).
But we’ll also generate more false alarms among true negatives (FPR↑).

If we move $\hat{x}$ to the right, the classifier becomes conservative:

FPR goes down (fewer false alarms),
but Recall/TPR goes down too (more misses, more FN).

In other words, tuning the threshold implements an explicit trade-off between detection rate and false alarm rate.
This is exactly what the ROC curve will visualize.

1.5 The F-score (F1 score): balancing Precision and Recall

The book introduces the F-score (often $F_1$) to summarize the balance between Precision and Recall. It is given in equations (5.38)–(5.39):

$$
F =
\frac{2 \times \text{Precision} \times \text{Recall}}
{\text{Precision} + \text{Recall}}.
\tag{5.38}
$$

By expanding Precision and Recall in terms of confusion matrix counts, we can write:

$$
F =
\frac{2 N_{TP}}
{2N_{TP} + N_{FP} + N_{FN}}.
\tag{5.39}
$$

Key property: this is the harmonic mean of Precision and Recall, not their simple arithmetic mean.
If either Precision or Recall is poor, the F-score will drop sharply.
This forces us to care about both “don’t over-alert” (high Precision) and “don’t miss true positives” (high Recall), instead of optimizing just one.

Part II (continuing from 5.2.5): The ROC Curve

(Receiver Operating Characteristic, Figure 5.11)

2.1 Axes of the ROC curve

The ROC curve (Figure 5.11) is a global way to evaluate a binary classifier.

Horizontal axis: False Positive Rate (FPR), from equation (5.32).
Vertical axis: True Positive Rate (TPR), which is the same as Recall in (5.31).

So, each point on the ROC curve answers:
“At this threshold, what’s my false alarm rate versus how many true positives I’m catching?”

In signal detection theory and medical diagnosis:

TPR is also called sensitivity,
FPR is the false alarm rate.
The ROC curve therefore plots “detection ability vs. false alarm cost” over all thresholds.

2.2 How an ROC curve is constructed

Recall the threshold $\hat{x}$ from Figure 5.10.
Now imagine sliding $\hat{x}$ gradually:

Start extremely strict (almost no one is labeled positive).
Move to extremely lenient (almost everyone is labeled positive).

For each threshold setting, compute:

$ \text{TPR} = \frac{N_{TP}}{N_{TP} + N_{FN}} $
$ \text{FPR} = \frac{N_{FP}}{N_{FP} + N_{TN}} $

Plot TPR vs. FPR.
Connecting those points across all possible thresholds gives the ROC curve.

In clinical screening language, this asks:
“If I lower the bar for calling someone ‘diseased,’ I’ll catch more true disease cases (TPR↑), but I’ll also falsely alarm more healthy people (FPR↑). How does that trade-off evolve as I keep sliding the bar?”

2.3 Special positions on the ROC plot

The point $(0, 1)$: FPR = 0, TPR = 1.
This corresponds to a perfect classifier at some threshold: zero false alarms and zero misses. In practice, this is nearly unattainable.
The diagonal line TPR = FPR:
This represents random guessing. For example, flipping a coin to decide “positive vs. negative” gives TPR ≈ FPR for all thresholds. A classifier whose ROC curve hugs this diagonal is basically no better than random guessing.
The closer a classifier’s ROC curve is to the upper left corner, the better it is overall: at the same FPR, it achieves a higher TPR.
If one curve consistently lies above another, we say it “dominates” the other.

2.4 AUC (Area Under the Curve)

In practice we often summarize the entire ROC curve using the AUC (Area Under Curve):

AUC = 1 means essentially perfect discrimination.
AUC = 0.5 means essentially random guessing.

A common interpretation is:
AUC is the probability that the classifier assigns a higher “score” to a randomly chosen true positive example than to a randomly chosen true negative example.

Why is this valuable?
ROC and AUC let us compare classifiers without first committing to any specific threshold.
This is especially important in highly imbalanced or high-stakes domains (rare disease detection, fraud detection, security screening), because different applications may tolerate different trade-offs between “catching positives” and “avoiding false alarms.”

Part III: 5.3 “Generative classifiers” — The Generative Classifier View

Now we shift from evaluation to modeling.

All the metrics above (Accuracy, Precision, Recall, ROC and AUC, etc.) assumed we already had a classifier that can output some score or probability.

But where do those probabilities come from?
Section 5.3 answers this by introducing the idea of a generative classifier.

The idea is very direct:

For each class $C_k$, we specify a class prior $p(C_k)$.
For each class $C_k$, we also specify a class-conditional density $p(\mathbf{x} \mid C_k)$, which says, “If a sample truly belongs to class $C_k$, how is the observed feature vector $\mathbf{x}$ generated?”
Then we apply Bayes’ rule to “invert” this generative story and obtain a posterior:

$$
p(C_k \mid \mathbf{x})
=
\frac{p(\mathbf{x} \mid C_k), p(C_k)}
{\sum_j p(\mathbf{x} \mid C_j), p(C_j)}.
\tag{5.45 structure}
$$

This gives us not just a hard label, but an entire posterior probability distribution ${p(C_k \mid \mathbf{x})}_k$.

It’s called “generative” because it explicitly models how the data arise in each class.
This contrasts with “discriminative” methods, which directly model $p(C_k \mid \mathbf{x})$ or directly learn a decision boundary, without necessarily modeling how each class generates data.

3.1 The binary case: the emergence of the sigmoid (Equations (5.40)–(5.44), Figure 5.12)

Consider two classes $C_1$ and $C_2$.
Bayes’ rule gives:

$$
p(C_1 \mid \mathbf{x})
=
\frac{p(\mathbf{x} \mid C_1), p(C_1)}
{p(\mathbf{x} \mid C_1), p(C_1)
+ p(\mathbf{x} \mid C_2), p(C_2)}.
\tag{5.40}
$$

Define the log-odds (log-likelihood ratio including priors):

$$
a
= \ln
\frac{p(\mathbf{x} \mid C_1), p(C_1)}
{p(\mathbf{x} \mid C_2), p(C_2)}.
\tag{5.41}
$$

Then we can rewrite the posterior as the classic logistic sigmoid:

$$
p(C_1 \mid \mathbf{x})
=
\frac{1}{1+\exp(-a)}
=
\sigma(a),
\tag{5.42}
$$

where
$$
\sigma(a) = \frac{1}{1+\exp(-a)}
$$
is the sigmoid function.

Key properties of $\sigma(a)$:

As $a \to +\infty$, $\sigma(a) \to 1$: overwhelming evidence for $C_1$.
As $a \to -\infty$, $\sigma(a) \to 0$: overwhelming evidence for $C_2$.
At $a = 0$, $\sigma(0)=0.5$: the model is exactly undecided between the two classes.
(In Figure 5.10, this corresponds to the special threshold $x_0$ where the two class joint densities intersect — the evidence is balanced.)
$\sigma(a)$ is strictly increasing: stronger evidence for $C_1$ means higher posterior probability for $C_1$.

The book then states two important identities:

Symmetry:
$$
\sigma(-a) = 1 - \sigma(a).
\tag{5.43}
$$

Therefore,
$$
p(C_2 \mid \mathbf{x})
= 1 - p(C_1 \mid \mathbf{x})
= \sigma(-a).
$$

Inverse (the logit):
$$
a
=
\ln
\frac{\sigma}{1-\sigma}
=
\ln
\frac{p(C_1 \mid \mathbf{x})}
{1 - p(C_1 \mid \mathbf{x})}.
\tag{5.44}
$$

Figure 5.12 plots $\sigma(a)$ as a red S-shaped curve:

The horizontal axis is $a$ (the log-odds).
The vertical axis is $p(C_1 \mid \mathbf{x}) = \sigma(a)$.

The figure also overlays a blue dashed curve based on a (scaled) Gaussian cumulative distribution function, often called the probit link.
The two curves lie very close to each other.
This shows that both logistic regression (with sigmoid) and probit regression (with the Gaussian CDF) are doing essentially the same conceptual job:
they take a real-valued “evidence score” and squash it into the $[0,1]$ interval to interpret it as a probability.

3.2 The multiclass case: the softmax function (Equations (5.45)–(5.46))

For $K$ classes ${C_1, \dots, C_K}$, the same reasoning yields:

$$
p(C_k \mid \mathbf{x})
=
\frac{p(\mathbf{x} \mid C_k), p(C_k)}
{\sum_j p(\mathbf{x} \mid C_j), p(C_j)}.
\tag{5.45}
$$

Define
$$
a_k
=
\ln\big(
p(\mathbf{x} \mid C_k), p(C_k)
\big),
\tag{5.46}
$$

then we can write
$$
p(C_k \mid \mathbf{x})
=
\frac{\exp(a_k)}
{\sum_j \exp(a_j)}.
$$

This normalized exponential form is the softmax function.
Softmax converts any arbitrary real-valued scores ${a_k}$ into a valid categorical probability distribution ${p(C_k \mid \mathbf{x})}$ that sums to 1 and has nonnegative entries.
This is exactly why deep neural networks often end with a linear layer followed by a softmax: it provides a probabilistic interpretation over classes.

Part IV: 5.3.1 “Continuous inputs” — Continuous Inputs, Gaussian Class-Conditional Models, and Linear vs. Quadratic Decision Boundaries

Now we ground all of the above in a concrete and widely used modeling assumption:

Each class generates feature vectors $\mathbf{x} \in \mathbb{R}^D$ according to a multivariate Gaussian distribution.

4.1 Gaussian class-conditional densities (Equation (5.47))

We assume that for each class $C_k$, the class-conditional density is a $D$-dimensional Gaussian:

$$
p(\mathbf{x} \mid C_k)
=
\frac{1}{(2\pi)^{D/2} , |\boldsymbol{\Sigma}|^{1/2}}
\exp\left(
-\frac{1}{2}
(\mathbf{x} - \boldsymbol{\mu}_k)^T
\boldsymbol{\Sigma}^{-1}
(\mathbf{x} - \boldsymbol{\mu}_k)
\right),
\tag{5.47}
$$

where $\boldsymbol{\mu}_k$ is the mean vector of class $C_k$, and $\boldsymbol{\Sigma}$ is the covariance matrix.

Crucial assumption:
All classes share the same covariance matrix $\boldsymbol{\Sigma}$.
That is, the “shape” and “orientation” of the Gaussian ellipsoids are the same for every class, and only their centers $\boldsymbol{\mu}_k$ differ.

In classical statistical learning, this is exactly the assumption behind Linear Discriminant Analysis (LDA):
each class is modeled as a Gaussian with its own mean but a common covariance.
Visually (as in Figure 5.13’s left panel), each class looks like an elliptical “Gaussian hill” of the same shape/orientation, just shifted to a different center.

4.2 The two-class case: the posterior is a sigmoid of a linear function (Equations (5.48)–(5.50))

Let’s focus on two classes, $C_1$ and $C_2$.
We already know from (5.40)–(5.42) that

$$
p(C_1 \mid \mathbf{x}) = \sigma(a(\mathbf{x})),
$$

where $a(\mathbf{x})$ is the log-odds in (5.41).

If we substitute the Gaussian form (5.47) into (5.41), and crucially use the fact that both classes share the same $\boldsymbol{\Sigma}$, something beautiful happens:

All the quadratic terms in $\mathbf{x}$ cancel out (because they involve the same $\boldsymbol{\Sigma}^{-1}$).
What remains is a linear function of $\mathbf{x}$.

Concretely, the book shows:

$$
p(C_1 \mid \mathbf{x})
=
\sigma(\mathbf{w}^T \mathbf{x} + w_0),
\tag{5.48}
$$

with
$$
\mathbf{w}
=
\boldsymbol{\Sigma}^{-1}
(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2),
\tag{5.49}
$$

and
$$
w_0
=
-\frac{1}{2}
\boldsymbol{\mu}_1^T
\boldsymbol{\Sigma}^{-1}
\boldsymbol{\mu}_1
+
\frac{1}{2}
\boldsymbol{\mu}_2^T
\boldsymbol{\Sigma}^{-1}
\boldsymbol{\mu}_2
+
\ln \frac{p(C_1)}{p(C_2)}.
\tag{5.50}
$$

These formulas reveal several deep facts:

The posterior is sigmoid(linear score).
Under the Gaussian + shared-covariance assumption, the posterior probability has exactly the same form as logistic regression: a linear score $\mathbf{w}^T \mathbf{x} + w_0$ pushed through a sigmoid.
In other words, the functional form that logistic regression assumes is here derived from a generative story.
The decision boundary is a linear hyperplane.
A common prediction rule is “predict $C_1$ if $p(C_1 \mid \mathbf{x}) > 0.5$.”
Because $\sigma(z) > 0.5 \iff z > 0$, this is equivalent to:
$$
\mathbf{w}^T \mathbf{x} + w_0 > 0.
$$
The set of points satisfying $\mathbf{w}^T \mathbf{x} + w_0 = 0$ is a straight line in 2D, a plane in 3D, and more generally a hyperplane in $D$ dimensions.
So the decision boundary is linear.
Geometrically, $\mathbf{w}$ is normal (perpendicular) to that boundary and points toward the region more strongly associated with class $C_1$.
The prior affects the bias term $w_0$.
The term $\ln \frac{p(C_1)}{p(C_2)}$ inside $w_0$ shifts the boundary.
Intuitively, if class $C_1$ is believed to be more common or more important, the boundary moves to favor predicting $C_1$ more often.
This ties directly back to the threshold trade-offs we discussed with ROC curves: changing the effective bias changes the balance between catching positives and avoiding false alarms.

Figure 5.13 (right panel) visualizes this for two classes in 2D:

The surface plotted is $p(C_1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + w_0)$.
- Regions shaded strongly red correspond to posterior near 1 (the model is highly confident it’s class $C_1$).
- Regions shaded strongly blue correspond to posterior near 0 (high confidence in class $C_2$).
- The middle region (a purplish/gray transition band) corresponds to posterior $\approx 0.5$, i.e. uncertainty.
This transition band projects down onto the $(x_1, x_2)$ plane as a straight line: $\mathbf{w}^T \mathbf{x} + w_0 = 0$.
This is precisely the linear decision boundary.

In summary, with two Gaussian classes sharing the same covariance matrix, you get:

a linear discriminant function,
a sigmoid posterior,
and a straight-line boundary in the input space.

4.3 The multiclass case: a family of linear discriminant functions (Equations (5.51)–(5.53))

Now suppose we have $K$ classes, still under the assumption that all classes share the same covariance matrix $\boldsymbol{\Sigma}$.

In this case, each class $C_k$ gets its own linear discriminant function:

$$
a_k(\mathbf{x}) = \mathbf{w}k^T \mathbf{x} + w{k0}, \tag{5.51}
$$

with

$$
\mathbf{w}_k
=
\boldsymbol{\Sigma}^{-1}
\boldsymbol{\mu}_k,
\tag{5.52}
$$

and

$$
w_{k0}
= -\frac{1}{2}
\boldsymbol{\mu}_k^T
\boldsymbol{\Sigma}^{-1}
\boldsymbol{\mu}_k + \ln p(C_k).
\tag{5.53}
$$

We then feed these linear scores ${a_k(\mathbf{x})}$ into the softmax from (5.45)–(5.46):

$$
p(C_k \mid \mathbf{x})
=
\frac{\exp(a_k(\mathbf{x}))}
{\sum_j \exp(a_j(\mathbf{x}))}.
\quad\text{(same structure as (5.45)–(5.46))}
$$

The geometric meaning is elegant:

Each class $C_k$ assigns a linear score $a_k(\mathbf{x})$.
We can predict the class with the largest score, or use the softmax to interpret the scores as posterior probabilities.
The decision boundary between any two classes $C_i$ and $C_j$ is given by

$$
a_i(\mathbf{x}) = a_j(\mathbf{x}),
$$

which expands to

$$
(\mathbf{w}i - \mathbf{w}j)^T \mathbf{x} + (w{i0} - w{j0}) = 0.
$$

This is still a linear hyperplane.

Therefore, in the multiclass setting with a shared covariance matrix, all pairwise decision boundaries are linear.
So the input space is split into regions by a collection of flat hyperplanes.
This is exactly the multiclass generalization of LDA’s geometric picture.

4.4 Figure 5.14: When covariance matrices are not shared, the boundary bends

Figure 5.14 pushes us one step further:
What happens if we drop the assumption of a shared covariance matrix?

Figure 5.14, left panel:

The axes are $x_1$ and $x_2$.
We see three classes, shown in red, green, and blue.
Each class is drawn using a set of elliptical contour lines, illustrating its class-conditional density $p(\mathbf{x} \mid C_k)$.
The red and blue classes have contour ellipses with similar shape and orientation. This represents the case where they share the same covariance matrix.
The green class, however, has ellipses that are stretched/tilted differently, indicating a different covariance matrix.

Mathematically, this means we are now allowing
$$
p(\mathbf{x} \mid C_k)
=
\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k),
$$
with class-specific covariance matrices $\boldsymbol{\Sigma}_k$, instead of a single shared $\boldsymbol{\Sigma}$.

This model is known as Quadratic Discriminant Analysis (QDA).
QDA keeps the Gaussian assumption but allows each class to have its own covariance.
That extra flexibility captures more complex class shapes, but it also increases the number of parameters and can overfit if data are limited.

Figure 5.14, right panel:

The plane is color-coded by the posterior probabilities of the three classes:
- The intensity of red at a point $\mathbf{x}$ encodes $p(C_{\text{red}} \mid \mathbf{x})$.
- The intensity of green encodes $p(C_{\text{green}} \mid \mathbf{x})$.
- The intensity of blue encodes $p(C_{\text{blue}} \mid \mathbf{x})$.
So each location in $(x_1,x_2)$ space is painted with a blend of red/green/blue proportional to the posterior probabilities of the three classes at that point.
Overlaid in white are the decision boundaries, i.e. where two classes have equal posterior (or equal discriminant score).
- The boundary between the red and blue classes is drawn as a straight line. This is because those two classes share the same covariance matrix, so their log-odds difference cancels the quadratic terms and stays linear — just like in LDA.
- The boundaries between red vs. green or blue vs. green, however, are visibly curved. Because green has a different covariance matrix, the quadratic terms in $\mathbf{x}$ no longer cancel. The result is a quadratic (curved) decision boundary.
The three regions meet in a sort of Y-shaped junction, showing how the space is partitioned into zones where each class dominates in posterior probability.

This produces two crucial visual lessons:

If two classes share the same covariance matrix, their mutual boundary is still a linear hyperplane.
If two classes have different covariance matrices, their boundary becomes a curved quadratic contour.
In a multiclass setting, the input plane is split into multiple colored “territories,” and the borders between these territories can be straight lines or smooth curves depending on whether the classes involved share covariance structure.

Comparing Figure 5.13 and Figure 5.14:

In Figure 5.13, with two classes sharing a single covariance, the posterior surface $p(C_1 \mid \mathbf{x})$ transitions smoothly from “mostly blue” to “mostly red,” and that transition projects down as a straight line boundary.

In Figure 5.14, with classes having different covariances, the posterior coloring is divided by curved boundaries.
This is the essential geometric difference between Linear Discriminant Analysis (LDA) — shared covariance, linear boundaries — and Quadratic Discriminant Analysis (QDA) — class-specific covariance, curved (quadratic) boundaries.

Conclusion: A Unified Chain from Evaluation to Modeling

We can now assemble Sections 5.2.5 → 5.3.1 into a single coherent story:

Start with the confusion matrix (Section 5.2.5).
- The four counts $N_{TP}, N_{FP}, N_{TN}, N_{FN}$ define the confusion matrix (5.28).
- Accuracy (5.29) can be misleading, especially under severe class imbalance.
- We therefore introduce Precision (5.30), Recall/TPR (5.31), FPR (5.32), FDR (5.33), and the F-score (5.38)–(5.39), which explicitly encode the costs of false positives vs. false negatives.
Understand errors via a threshold (Figure 5.10).
- Figure 5.10 draws the joint densities $p(x, C_1)$ and $p(x, C_2)$ as two “probability hills,” and slices them with a threshold $\hat{x}$.
- The shaded areas (blue/green/red regions) correspond to FP and FN probability mass.
- Equations (5.34)–(5.37) map these areas to the expected fractions $\frac{N_{TP}}{N}$, $\frac{N_{FP}}{N}$, $\frac{N_{TN}}{N}$, $\frac{N_{FN}}{N}$.
- Moving $\hat{x}$ shifts those areas and thus changes Precision / Recall / FPR, etc.
Move to the ROC curve (Figure 5.11).
- By sweeping the threshold across all possible values and computing (FPR, TPR) for each, we get the ROC curve.
- The ROC curve visualizes the trade-off between “catching positives” and “avoiding false alarms.”
- The Area Under the Curve (AUC) summarizes this trade-off in a single number; the closer the ROC is to the upper-left corner, the better.
Ask: Where do these probabilities come from? (Section 5.3).
- Instead of treating the classifier as a black box, we model how each class generates data using $p(\mathbf{x} \mid C_k)$ and a prior $p(C_k)$.
- Bayes’ rule then yields the posterior $p(C_k \mid \mathbf{x})$.
- In the binary case, this becomes a sigmoid of the log-odds (5.40)–(5.44), as shown in Figure 5.12; in the multiclass case, it becomes a softmax (5.45)–(5.46).
Finally, tie this to geometry in feature space (Section 5.3.1).
- Assume each class-conditional is Gaussian with a shared covariance matrix (5.47).
- Then the posterior probabilities become a sigmoid / softmax of a linear function of $\mathbf{x}$ (5.48)–(5.53), and the decision boundaries are linear hyperplanes. This is exactly the setting of Linear Discriminant Analysis (LDA), visualized in Figure 5.13.
- If we allow each class to have its own covariance matrix, we get Quadratic Discriminant Analysis (QDA): boundaries between classes with different covariances become smooth curves (quadratic surfaces), as shown in Figure 5.14.
- Thus, the assumption about covariance — shared vs. class-specific — directly controls whether your decision surfaces are linear or quadratic.

In short, from 5.2.5 to 5.3.1 we have traveled the full arc:

Evaluation layer: confusion matrix, Precision/Recall, FPR, F-score, ROC, AUC — which tell us how a classifier behaves under different thresholds and different application costs.
Probabilistic layer: posterior probabilities $p(C_k \mid \mathbf{x})$ obtained by combining class-conditional models with class priors via Bayes’ rule, producing sigmoid and softmax forms.
Geometric layer: under Gaussian assumptions, those posteriors correspond to either linear or quadratic decision boundaries in input space.
Shared covariance ⇒ linear (LDA, Figure 5.13).
Class-specific covariance ⇒ curved quadratic surfaces (QDA, Figure 5.14).

This chain connects “how we measure mistakes,” “how we interpret probabilities,” and “what the classifier looks like in the geometry of feature space.”
It also provides the conceptual bridge between classical statistical pattern recognition and modern deep learning classifiers, which often end with a linear layer plus a sigmoid or softmax.

Minster

https://MarcuXu.github.io/2025/10/30/dl-reading-1029/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Minster !

AI_Tricks

Current

Speaking the Language of Probability-From Classifier Evaluation to Generative Classifiers

2025-10-30 Tech

AI_Tricks

Non-academic notes from my PhD journey far from home

2024-01-26 Clutter