The mathematical derivation (Gonzalez & Woods): Given an image with L gray levels, define: w₁(T) = probability of background class (pixels ≤ T), w₂(T) = probability of foreground class (pixels > T), m₁(T) = mean of background class, m₂(T) = mean of foreground class, m_G = global mean. The between-class variance is σ²_B(T) = w₁(T) × w₂(T) × [m₁(T) − m₂(T)]². Otsu proved that the threshold maximizing σ²_B is optimal in the sense of minimizing the probability of classification error when the two classes have equal covariance. The algorithm is O(L) — linear in the number of gray levels, making it extremely fast.
Solomon & Breckon's intuition — "tightest clustering": The method "finds that threshold which minimizes the within-class variance of the thresholded black and white pixels... selects the threshold which results in the tightest clustering of the two groups." This is the clearest non-mathematical description: Otsu picks the dividing line that makes each side of the line most internally uniform.
When Otsu fails — the three cases: Gonzalez & Woods identify critical failure modes: (1) Small object area — when the foreground occupies a tiny fraction of the image (e.g., sparse fluorescent dots on a large dark background), the foreground peak is too small to influence the histogram significantly, and Otsu may place the threshold in the middle of the background distribution; (2) High noise — noise fills the valley between the histogram peaks, making the bimodal structure less distinct and the optimal threshold less stable; (3) Non-uniform illumination — a single global threshold cannot correctly classify all regions when background intensity varies spatially. Adaptive (local) thresholding, which applies Otsu in small neighborhoods, addresses case (3).
Otsu in the broader context: Otsu's method assumes no spatial structure — it treats each pixel independently based on intensity alone. This is a Bayesian classifier with equal Gaussian variances and equal priors, applied to the one-dimensional intensity domain. More sophisticated approaches (MRF-based segmentation, learned classifiers) use spatial context — the fact that neighboring pixels tend to belong to the same class — for better results. But Otsu's simplicity, speed, and parameter-free nature make it the standard first approach.
Otsu finds the threshold that makes the two groups (foreground and background) internally most uniform and maximally different from each other. It works beautifully when the histogram has two clear peaks with a valley between them. It struggles when the foreground is sparse (tiny peak lost in the background), when noise fills the valley, or when illumination is uneven. Despite these limitations, it remains the standard automatic thresholding method because it's fast, simple, and parameter-free.