Picture. In a way, we can understand a classifier (or model) as a machine which sorts apples from oranges. It is true that we're usually interested in the more general case of sorting 'all fruits', but it is also true that even a very general predictive model can either succeed or fail, and in that sense, every model is at some point sorting apples from oranges. Let's say that we like apples and hate oranges. We're given a black box the content of which is unknown. The person who gave it to us loves us and doesn't want us to suffer, so we assume that it's an apple. The model - the idea that if they love us they will not give us apples - is a sentence which can be true or false. Measuring the extent to which it can be true or false is, in essence, the idea behind the Receiver Operating Curve and it's sister, the Area under the Curve. More generally: Pick one truly positive case (a 'suspected' apple which turns out to be an actual apple) and one truly negative at random; ask whether your model ranks the positive higher. The probability of that event is the area under the ROC curve (AUC). This post motivates the statement, derives it in a few lines, and provides Python to generate figures that make the idea tangible.
1. Motivation: a gate through fog
Scores live on a line. Negatives follow a distribution with CDF \(F_0\), positives \(F_1\). Move a threshold \(t\) from right to left; call everything to the right “positive.” At each location, the false‑positive and true‑positive rates are
\[ \mathrm{FPR}(t)=\Pr(S>t\mid 0)=1-F_0(t),\qquad \mathrm{TPR}(t)=\Pr(S>t\mid 1)=1-F_1(t). \]
Plotting \(\mathrm{TPR}\) against \(\mathrm{FPR}\) as \(t\) sweeps gives the ROC curve. Its area is AUC.
Figure 1 — code: simulate two overlapping classes and show score distributions
# Figure 1: Score distributions (negatives vs positives)
# Save as: assets/img/auc/distributions.png
import os, numpy as np, matplotlib.pyplot as plt
os.makedirs("assets/img/auc", exist_ok=True)
rng = np.random.default_rng(42)
neg = rng.normal(0.0, 1.0, 4000)
pos = rng.normal(1.25, 1.0, 4000)
bins = np.linspace(min(neg.min(), pos.min()) - 0.5,
max(neg.max(), pos.max()) + 0.5, 60)
plt.figure(figsize=(7, 4.5))
plt.hist(neg, bins=bins, alpha=0.6, density=True, label="Negative")
plt.hist(pos, bins=bins, alpha=0.6, density=True, label="Positive")
plt.xlabel("Score S"); plt.ylabel("Density")
plt.title("Score distributions")
plt.legend(); plt.tight_layout()
plt.savefig("assets/img/auc/distributions.png", dpi=150); plt.close()
2. The crux in one sentence
AUC equals the probability that a randomly chosen positive receives a higher score than a randomly chosen negative. Formally, with independent scores \(S_1\sim f_1\) (positive) and \(S_0\sim f_0\) (negative):
\[ \boxed{\ \mathrm{AUC}=\Pr(S_1>S_0)\ } \quad \text{(with ties counted as }+\tfrac{1}{2}\Pr(S_1=S_0)\text{).} \]
This is the Wilcoxon–Mann–Whitney view of AUC and explains its invariances: AUC measures ranking, not calibration.
3. The derivation, start to finish
Begin with the geometric definition of area under the ROC:
\[ \mathrm{AUC}=\int_{0}^{1} y(x)\,dx,\quad x(t)=1-F_0(t),\ y(t)=1-F_1(t). \]
Parameterize by the threshold. Since \(dx/dt=-f_0(t)\):
\[ \mathrm{AUC}=\int_{-\infty}^{+\infty}\!\big[1-F_1(t)\big]\,f_0(t)\,dt =\mathbb{E}_{S_0\sim f_0}\big[1-F_1(S_0)\big] =\Pr(S_1>S_0). \]
Figure 2 — code: visualize AUC as pairwise wins (positives vs negatives)
# Figure 2: Pairwise wins matrix; its mean ≈ AUC
# Save as: assets/img/auc/pairwise_matrix.png
import os, numpy as np, matplotlib.pyplot as plt
os.makedirs("assets/img/auc", exist_ok=True)
rng = np.random.default_rng(0)
neg = np.sort(rng.normal(0.0, 1.0, 250))
pos = np.sort(rng.normal(1.25, 1.0, 250))
M = (pos[:, None] > neg[None, :]).astype(float)
plt.figure(figsize=(6, 6))
plt.imshow(M, aspect="auto", origin="lower", interpolation="nearest")
plt.colorbar(label="1 if S_pos > S_neg else 0")
plt.xlabel("Negative samples (sorted)"); plt.ylabel("Positive samples (sorted)")
plt.title(f"Pairwise wins; mean = {M.mean():.3f} ≈ AUC")
plt.tight_layout(); plt.savefig("assets/img/auc/pairwise_matrix.png", dpi=150); plt.close()
4. Reading the ROC like a map
Differentiating the parametric form gives \(dx/dt=-f_0(t)\) and \(dy/dt=-f_1(t)\), so the instantaneous slope of the ROC is the likelihood ratio at threshold \(t\):
\[ \frac{dy}{dx}=\frac{f_1(t)}{f_0(t)}. \]
Where the curve is steep, tiny relaxations of the threshold buy many true positives per false positive. This is the Neyman–Pearson lens, in geometry.
Figure 3 — code: empirical ROC and a numerical tangent (slope ≈ likelihood ratio)
# Figure 3: Empirical ROC with a local tangent
# Save as: assets/img/auc/roc_curve.png
import os, numpy as np, matplotlib.pyplot as plt
os.makedirs("assets/img/auc", exist_ok=True)
rng = np.random.default_rng(1)
neg = rng.normal(0.0, 1.0, 4000); pos = rng.normal(1.25, 1.0, 4000)
scores = np.concatenate([neg, pos])
ytrue = np.concatenate([np.zeros_like(neg, dtype=int), np.ones_like(pos, dtype=int)])
# Minimal ROC from scratch
order = np.argsort(-scores, kind="mergesort")
y = ytrue[order]
P = y.sum(); N = len(y) - P
tps = np.cumsum(y); fps = np.cumsum(1 - y)
tpr = np.concatenate(([0.0], tps / P, [1.0]))
fpr = np.concatenate(([0.0], fps / N, [1.0]))
# Pick a point near FPR≈0.2 and estimate slope numerically
idx = np.argmin(np.abs(fpr - 0.2))
x0, y0 = fpr[idx], tpr[idx]
i1 = max(1, idx - 3); i2 = min(len(fpr) - 2, idx + 3)
slope = (tpr[i2] - tpr[i1]) / (fpr[i2] - fpr[i1])
# Plot
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, lw=2, label="ROC")
plt.plot([0, 1], [0, 1], "--", lw=1, label="Chance")
x = np.array([x0 - 0.15, x0 + 0.15]); yline = y0 + slope * (x - x0)
plt.plot(x, yline, lw=1, label=f"Tangent slope ≈ {slope:.2f}")
plt.scatter([x0], [y0], s=30)
plt.xlim(0, 1); plt.ylim(0, 1)
plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate")
plt.title("ROC and local slope (likelihood ratio)")
plt.legend(); plt.tight_layout()
plt.savefig("assets/img/auc/roc_curve.png", dpi=150); plt.close()
5. Three equivalent pictures of AUC
- Integral under tradeoff: \(\displaystyle \mathrm{AUC}=\int_0^1 y(x)\,dx\).
- Pairwise probability: \(\mathrm{AUC}=\Pr(S_1>S_0)\) (with ties as half‑wins).
- Difference distribution: \(D=S_1-S_0\); then \(\mathrm{AUC}=\Pr(D>0)\).
Figure 4 — code: difference distribution; area right of 0 equals AUC
# Figure 4: Distribution of differences D = S_pos - S_neg
# Save as: assets/img/auc/difference_distribution.png
import os, numpy as np, matplotlib.pyplot as plt
os.makedirs("assets/img/auc", exist_ok=True)
rng = np.random.default_rng(7)
neg = rng.normal(0.0, 1.0, 10000)
pos = rng.normal(1.25, 1.0, 10000)
d = rng.choice(pos, 20000) - rng.choice(neg, 20000)
plt.figure(figsize=(7, 4.5))
plt.hist(d, bins=80, density=True)
plt.axvline(0.0, linestyle="--")
plt.xlabel("D = S_pos - S_neg"); plt.ylabel("Density")
plt.title(f"P(D>0) ≈ {np.mean(d>0):.3f} ≈ AUC")
plt.tight_layout(); plt.savefig("assets/img/auc/difference_distribution.png", dpi=150); plt.close()
6. Practicalities in one place
- Empirical AUC is pair counting. With \(m\) positives and \(n\) negatives, average \(\mathbf{1}\{s_{1,i}>s_{0,j}\}\) over all \(mn\) cross‑pairs; give ties half‑credit. This is a \(U\)‑statistic; DeLong’s method provides a widely used nonparametric variance and a test for comparing curves (see references).
- Prevalence‑free and rank‑only. Because AUC conditions on class and depends only on order, it’s stable against prior shifts and monotone re‑scalings of scores.
Minimal ROC/AUC — code
# Minimal ROC/AUC from scores and labels
# Usage: fpr, tpr, auc = roc_auc(y_true, y_score)
import numpy as np
def roc_auc(y_true, y_score):
y_true = np.asarray(y_true).astype(int)
y_score = np.asarray(y_score).astype(float)
order = np.argsort(-y_score, kind="mergesort")
y = y_true[order]
P = y.sum(); N = len(y) - P
tps = np.cumsum(y); fps = np.cumsum(1 - y)
tpr = np.concatenate(([0.0], tps / P, [1.0]))
fpr = np.concatenate(([0.0], fps / N, [1.0]))
auc = np.trapz(tpr, fpr)
return fpr, tpr, auc
If you want a library call, see scikit‑learn’s roc_curve and roc_auc_score.
7. Notes & references
- ROC overview and AUC‑as‑probability (includes derivation): site:wikipedia.org
- Wilcoxon–Mann–Whitney connection to AUC: site:wikipedia.org
- The ROC’s tangent slope equals the likelihood ratio at the corresponding threshold: Choi (1998)
- Precision–Recall vs ROC under imbalance should be interpreted with care; see Saito & Rehsmeier: Saito & Rehmsmeier (2015)
- Variance/CI for AUC and comparing ROCs: DeLong et al. (1988)
- The Gini rescale, \((2\mathrm{AUC}-1)\), is widely used in credit scoring: site:wikipedia.org