Skip to content

System Comparison & Quality Gates

A/B testing, go/no-go decisions, and effect sizes for LLM evaluation

Decision Flowchart

Before diving into the API, use this to pick the right tool:

Need to compare two systems?
  ├── Binary scores  → compare_systems(score_type="binary")
  ├── Ordinal scores → compare_systems(score_type="ordinal")
  └── Continuous scores → compare_systems(score_type="continuous")

Need a go/no-go decision?
  ├── Minimum quality bar → threshold_gate()
  └── "Don't regress"     → non_inferiority_test()

Testing multiple metrics at once?
  └── Correct p-values   → multiple_comparison_correction()

System Comparison

compare_systems

compare_systems(
    scores_a,               # Baseline system scores
    scores_b,               # Candidate system scores
    score_type="binary",    # "binary" | "ordinal" | "continuous"
    n_resamples=10_000,     # Bootstrap iterations
    confidence_level=0.95,  # CI width
    seed=None,              # Reproducibility
) -> ComparisonResult

Auto-selects the appropriate statistical test based on score_type and sample size:

Score type Test selected
binary (large n) McNemar's test
binary (small n) Permutation test
ordinal Wilcoxon signed-rank
continuous Paired bootstrap

Returns a ComparisonResult with delta, ci_lower, ci_upper, p_value, effect_size, and a human-readable interpretation.

from latent.stats import compare_systems

result = compare_systems(scores_gpt35, scores_gpt4, score_type="binary")

print(result.delta)           # 0.150
print(result.p_value)         # 0.0023
print(result.interpretation)
# "System B scores 0.150 higher (statistically significant, p=0.0023)"

Tip

Scores must be paired -- the same examples evaluated by both systems, in the same order. Paired tests are almost always more powerful than unpaired alternatives for LLM evaluation because they control for per-example difficulty.

mcnemars_test

mcnemars_test(scores_a, scores_b) -> ComparisonResult

Specialized test for paired binary data. Tests whether the discordant pairs (cases where the two systems disagree) are symmetric.

When to use: binary outcomes with n >= 25. For smaller samples, compare_systems automatically falls back to a permutation test.

from latent.stats import mcnemars_test

result = mcnemars_test(baseline_binary, candidate_binary)
print(f"p = {result.p_value:.4f}")

Note

McNemar's test only considers the discordant pairs -- examples where one system is correct and the other is wrong. Concordant pairs (both right or both wrong) contribute no information and are ignored.


Quality Gates

threshold_gate

threshold_gate(
    metric,                     # A MetricResult (e.g. from bootstrap_ci)
    threshold,                  # Minimum acceptable value
    strictness="lower_ci",      # "point_estimate" | "lower_ci"
) -> GatingResult

Check if a metric exceeds a minimum quality bar. Returns a GatingResult with passed, threshold, actual_value, and margin.

Strictness Behavior Recommended?
"point_estimate" Pass if the mean exceeds the threshold No -- too lenient
"lower_ci" Pass only if the lower bound of the CI exceeds the threshold Yes
from latent.stats import bootstrap_ci, threshold_gate

metric = bootstrap_ci(f1_scores, seed=42)
gate = threshold_gate(metric, threshold=0.80, strictness="lower_ci")

if not gate.passed:
    print(f"BLOCKED: F1 lower CI is {gate.actual_value:.3f}, need >= {gate.threshold}")
    sys.exit(1)  # Fail the pipeline

Warning

Point-estimate gating is available but discouraged. A model scoring 0.86 on 50 examples could easily have a true accuracy below 0.80. Use "lower_ci" to gate on statistical confidence, not luck.

non_inferiority_test

non_inferiority_test(
    scores_a,               # Baseline system scores
    scores_b,               # Candidate system scores
    margin,                 # Maximum acceptable regression
    score_type="binary",
    confidence_level=0.95,
    n_resamples=10_000,
    seed=None,
) -> GatingResult

Tests whether System B is "not worse than A by more than margin." This is the right tool for CI/CD: deploy a new model only if it doesn't regress beyond a tolerance.

from latent.stats import non_inferiority_test

gate = non_inferiority_test(
    baseline_scores,
    candidate_scores,
    margin=0.05,  # Allow up to 5% regression
)

if gate.passed:
    print("Safe to deploy -- candidate is non-inferior")
else:
    print(f"Regression detected: delta={gate.actual_value:.3f}")

Choosing a margin

The margin should reflect what regression is acceptable in practice. Common values: 0.02 for safety-critical metrics, 0.05 for general quality, 0.10 for low-stakes comparisons. When in doubt, start strict and relax later.

multiple_comparison_correction

multiple_comparison_correction(
    p_values,           # list of raw p-values
    method="holm",      # "holm" | "fdr_bh"
) -> list[float]

Adjust p-values when testing multiple metrics simultaneously to control false positives.

Method Controls Use when
"holm" Family-wise error rate (FWER) You need every claim to be valid
"fdr_bh" False discovery rate (FDR) You tolerate some false positives among many tests
from latent.stats import multiple_comparison_correction

raw_pvals = [0.01, 0.04, 0.03, 0.20, 0.005]
adjusted = multiple_comparison_correction(raw_pvals, method="holm")

for name, raw, adj in zip(metric_names, raw_pvals, adjusted):
    sig = "significant" if adj < 0.05 else "not significant"
    print(f"{name}: raw={raw:.4f}, adjusted={adj:.4f} ({sig})")

Note

If you are comparing two systems across 5 metrics at alpha=0.05, there is a ~23% chance of at least one false positive without correction. Always correct when reporting multiple comparisons.


Effect Sizes

Effect sizes quantify how large a difference is, independent of sample size. A comparison can be statistically significant but practically negligible, or vice versa. Always report effect sizes alongside p-values.

The following functions are covered in detail in Core Primitives:

Function Score type Interpretation
cohens_d(scores_a, scores_b) Continuous Small: 0.2, Medium: 0.5, Large: 0.8
odds_ratio(scores_a, scores_b) Binary How much more likely System B is to succeed
risk_ratio(scores_a, scores_b) Binary Relative risk of success between systems
common_language_effect_size(scores_a, scores_b) Any Probability a random draw from B beats A
from latent.stats import cohens_d, common_language_effect_size

d = cohens_d(scores_baseline, scores_candidate)
print(f"Cohen's d = {d:.2f}")  # e.g. 0.45 (medium effect)

cles = common_language_effect_size(scores_baseline, scores_candidate)
print(f"P(candidate > baseline) = {cles:.1%}")  # e.g. 62.5%

Tip

common_language_effect_size is the most intuitive measure: "If you pick a random example, what is the probability that the candidate scores higher than the baseline?" Use this when presenting results to non-statistical audiences.


Full Example: Model Upgrade Pipeline

Putting it all together -- compare a candidate model against the current baseline, enforce quality gates, and log the result:

import sys
import numpy as np
from latent.stats import (
    compare_systems,
    threshold_gate,
    non_inferiority_test,
    bootstrap_ci,
    common_language_effect_size,
)

# -- 1. Compare systems --
comparison = compare_systems(
    baseline_scores, candidate_scores, score_type="binary", seed=42
)
print(comparison.interpretation)

# -- 2. Absolute quality gate --
candidate_metric = bootstrap_ci(candidate_scores, seed=42)
abs_gate = threshold_gate(candidate_metric, threshold=0.85, strictness="lower_ci")

# -- 3. Non-inferiority gate --
ni_gate = non_inferiority_test(
    baseline_scores, candidate_scores, margin=0.03, seed=42
)

# -- 4. Effect size for context --
cles = common_language_effect_size(baseline_scores, candidate_scores)

# -- 5. Decision --
if abs_gate.passed and ni_gate.passed:
    print(f"DEPLOY: candidate passes all gates (CLES={cles:.1%})")
else:
    reasons = []
    if not abs_gate.passed:
        reasons.append(f"quality below threshold ({abs_gate.actual_value:.3f} < {abs_gate.threshold})")
    if not ni_gate.passed:
        reasons.append(f"regression exceeds margin ({ni_gate.actual_value:.3f})")
    print(f"BLOCK: {', '.join(reasons)}")
    sys.exit(1)