Classifier | strataquest

View

Definition

When simple gating rules aren't enough to distinguish cell types, the Classifier engine uses machine learning to learn the distinction from examples you provide. You select representative cells of each type — "these are tumor cells, these are lymphocytes, these are stromal cells" — and the algorithm learns the combination of measurements that best separates them. It then applies this learned distinction to classify every cell in the dataset, handling subtle patterns that would be impossible to express as explicit threshold rules.

Supervised Learning

Learn from expert-annotated examples

Five Algorithm Options

Random Forest, KNN, SVM, Naive Bayes, ANN

Multi-Feature Decision

Combines dozens of measurements

Texture Features

Beyond intensity: spatial patterns

How It Works

The Classifier engine implements supervised machine learning for cell classification:

Training set creation — The analyst selects representative cells for each class by clicking on them in the image. Each selected cell contributes its measurement vector (all intensity, shape, and texture features) as a labeled training example.
Feature selection — Choose which measurements to include as classification features. More features provide more discriminating information but increase the risk of overfitting if training sets are small.
Algorithm selection — Choose from Random Forest, K-Nearest Neighbors, Support Vector Machine, Naive Bayes, or Artificial Neural Network.
Training — The algorithm learns decision boundaries from the training data. The learned model is a function: feature vector → class label.
Classification — Every cell in the dataset is classified by applying the learned model to its feature vector. Confidence scores indicate how certain the classification is for each cell.

Simplified

You show the classifier examples of each cell type. It learns what distinguishes them — which combination of marker intensities and shape features separates tumor cells from lymphocytes from stromal cells. Then it applies these learned distinctions to classify every cell in the dataset automatically.

Science Behind It

Feature selection criteria (Solomon & Breckon): Good classification features should have two properties: "(i) distribution over classes should be as widely separated as possible and (ii) features should be statistically independent of each other." The first criterion means the feature actually discriminates between classes (CD3 intensity separates T cells from tumor cells). The second means each feature adds new information (if two features are perfectly correlated, one is redundant).

The Bayesian foundation: All classification can be framed as Bayesian inference: assign cell x to class C_j if P(C_j|x) is highest, where P(C_j|x) &propto; P(x|C_j) × P(C_j). Naive Bayes assumes features are conditionally independent given the class. SVM finds the hyperplane that maximizes the margin between classes. Random Forest builds many decision trees on random feature subsets and votes. Each algorithm makes different assumptions about the structure of the data.

Texture features — Haralick and beyond: Solomon & Breckon describe Haralick texture features derived from the Gray-Level Co-occurrence Matrix (GLCM): contrast, correlation, energy, homogeneity. These capture spatial patterns in intensity — a cell with granular staining has high GLCM contrast; a cell with uniform staining has high homogeneity. Texture features are particularly valuable when two cell types have similar mean intensities but different staining patterns (e.g., diffuse cytoplasmic vs. punctate vesicular).

The curse of dimensionality: With dozens of features available (6 channels × 3 compartments × 5 statistics = 90+ features), the feature space is high-dimensional. Classification accuracy can paradoxically decrease as more features are added if the training set is too small (overfitting). The Fisher Linear Discriminant approach — projecting high-dimensional data onto the axis that maximally separates classes — is the theoretical foundation for feature selection: keep the features that contribute most to class separation.

Asymmetric error costs: Solomon & Breckon note: "the cost of a false-negative (abnormal classified as normal) is considerably greater than the other kind of misclassification." In tissue analysis, misclassifying a tumor cell as a lymphocyte (false negative for tumor) may have different consequences than the reverse. Classification thresholds can be adjusted to favor sensitivity (catch all tumor cells) or specificity (don't misclassify any lymphocytes) depending on the clinical context.

Simplified

The classifier learns to distinguish cell types by finding the combination of measurements that best separates them — like learning to distinguish apple varieties by combining color, size, and texture rather than any single feature. Random Forest is often the best starting choice because it handles complex patterns, doesn't require feature scaling, and is robust to irrelevant features. The main risk is overfitting: using too many features with too few training examples makes the model memorize the training data rather than learning generalizable patterns.

Parameters & Settings

Parameter	Type	Description
Training Data	Cell selections	Expert-selected representative cells for each class.
Features	Multi-select	Which measurements to use as classification features.
Algorithm	Selection	Random Forest (recommended default), KNN, SVM, Naive Bayes, or ANN.
Algorithm Parameters	Algorithm-specific	Trees (RF), K (KNN), kernel/C (SVM), hidden layers (ANN).
Cross-Validation	Toggle	Evaluate accuracy using leave-one-out or k-fold cross-validation on training data.

Simplified

Select representative cells for each class, choose which measurements to use as features, and pick an algorithm (Random Forest is a safe default). Cross-validation tells you how accurate the classifier is on the training data — if it's low, add more training examples or different features.

Practical Example

Classifying cells in H&E-stained tissue where simple marker gating isn't available:

Select ~30 tumor cells, ~30 lymphocytes, ~30 stromal cells from representative regions
Features: nuclear area, compactness, eccentricity, hematoxylin intensity, eosin intensity, GLCM texture features
Random Forest classifier with 100 trees
Cross-validation accuracy: 94%
Apply to all 45,000 cells → per-cell classification with confidence scores

Without multiplexed markers, classification relies on morphological and textural features that require a learned classifier rather than simple gating rules.

Simplified

In H&E tissue without specific markers, the classifier learns to distinguish cell types by morphology — tumor cells are larger with irregular nuclei, lymphocytes are small and round, stromal cells are elongated. Training on 30 examples per class and using nuclear shape and texture features achieves ~94% accuracy across all 45,000 cells.

Connected Terms

Virtual Channel Category Learning Path
Assign Classes to Objects Related
Logical Operations Related
Deep Learning Nuclei Detection Related
Digital Pathology Related
Regions of Interest Related
Tissue Detection Related
Total Area Related