System Comparison & Quality Gates¶
A/B testing, go/no-go decisions, and effect sizes for LLM evaluation
Decision Flowchart¶
Before diving into the API, use this to pick the right tool:
Need to compare two systems?
├── Binary scores → compare_systems(score_type="binary")
├── Ordinal scores → compare_systems(score_type="ordinal")
└── Continuous scores → compare_systems(score_type="continuous")
Need a go/no-go decision?
├── Minimum quality bar → threshold_gate()
└── "Don't regress" → non_inferiority_test()
Testing multiple metrics at once?
└── Correct p-values → multiple_comparison_correction()
System Comparison¶
compare_systems¶
compare_systems(
scores_a, # Baseline system scores
scores_b, # Candidate system scores
score_type="binary", # "binary" | "ordinal" | "continuous"
n_resamples=10_000, # Bootstrap iterations
confidence_level=0.95, # CI width
seed=None, # Reproducibility
) -> ComparisonResult
Auto-selects the appropriate statistical test based on score_type and sample size:
| Score type | Test selected |
|---|---|
binary (large n) |
McNemar's test |
binary (small n) |
Permutation test |
ordinal |
Wilcoxon signed-rank |
continuous |
Paired bootstrap |
Returns a ComparisonResult with delta, ci_lower, ci_upper, p_value, effect_size, and a human-readable interpretation.
from latent.stats import compare_systems
result = compare_systems(scores_gpt35, scores_gpt4, score_type="binary")
print(result.delta) # 0.150
print(result.p_value) # 0.0023
print(result.interpretation)
# "System B scores 0.150 higher (statistically significant, p=0.0023)"
Tip
Scores must be paired -- the same examples evaluated by both systems, in the same order. Paired tests are almost always more powerful than unpaired alternatives for LLM evaluation because they control for per-example difficulty.
mcnemars_test¶
Specialized test for paired binary data. Tests whether the discordant pairs (cases where the two systems disagree) are symmetric.
When to use: binary outcomes with n >= 25. For smaller samples, compare_systems automatically falls back to a permutation test.
from latent.stats import mcnemars_test
result = mcnemars_test(baseline_binary, candidate_binary)
print(f"p = {result.p_value:.4f}")
Note
McNemar's test only considers the discordant pairs -- examples where one system is correct and the other is wrong. Concordant pairs (both right or both wrong) contribute no information and are ignored.
Quality Gates¶
threshold_gate¶
threshold_gate(
metric, # A MetricResult (e.g. from bootstrap_ci)
threshold, # Minimum acceptable value
strictness="lower_ci", # "point_estimate" | "lower_ci"
) -> GatingResult
Check if a metric exceeds a minimum quality bar. Returns a GatingResult with passed, threshold, actual_value, and margin.
| Strictness | Behavior | Recommended? |
|---|---|---|
"point_estimate" |
Pass if the mean exceeds the threshold | No -- too lenient |
"lower_ci" |
Pass only if the lower bound of the CI exceeds the threshold | Yes |
from latent.stats import bootstrap_ci, threshold_gate
metric = bootstrap_ci(f1_scores, seed=42)
gate = threshold_gate(metric, threshold=0.80, strictness="lower_ci")
if not gate.passed:
print(f"BLOCKED: F1 lower CI is {gate.actual_value:.3f}, need >= {gate.threshold}")
sys.exit(1) # Fail the pipeline
Warning
Point-estimate gating is available but discouraged. A model scoring 0.86 on
50 examples could easily have a true accuracy below 0.80. Use "lower_ci"
to gate on statistical confidence, not luck.
non_inferiority_test¶
non_inferiority_test(
scores_a, # Baseline system scores
scores_b, # Candidate system scores
margin, # Maximum acceptable regression
score_type="binary",
confidence_level=0.95,
n_resamples=10_000,
seed=None,
) -> GatingResult
Tests whether System B is "not worse than A by more than margin." This is the right tool for CI/CD: deploy a new model only if it doesn't regress beyond a tolerance.
from latent.stats import non_inferiority_test
gate = non_inferiority_test(
baseline_scores,
candidate_scores,
margin=0.05, # Allow up to 5% regression
)
if gate.passed:
print("Safe to deploy -- candidate is non-inferior")
else:
print(f"Regression detected: delta={gate.actual_value:.3f}")
Choosing a margin
The margin should reflect what regression is acceptable in practice. Common
values: 0.02 for safety-critical metrics, 0.05 for general quality, 0.10
for low-stakes comparisons. When in doubt, start strict and relax later.
multiple_comparison_correction¶
multiple_comparison_correction(
p_values, # list of raw p-values
method="holm", # "holm" | "fdr_bh"
) -> list[float]
Adjust p-values when testing multiple metrics simultaneously to control false positives.
| Method | Controls | Use when |
|---|---|---|
"holm" |
Family-wise error rate (FWER) | You need every claim to be valid |
"fdr_bh" |
False discovery rate (FDR) | You tolerate some false positives among many tests |
from latent.stats import multiple_comparison_correction
raw_pvals = [0.01, 0.04, 0.03, 0.20, 0.005]
adjusted = multiple_comparison_correction(raw_pvals, method="holm")
for name, raw, adj in zip(metric_names, raw_pvals, adjusted):
sig = "significant" if adj < 0.05 else "not significant"
print(f"{name}: raw={raw:.4f}, adjusted={adj:.4f} ({sig})")
Note
If you are comparing two systems across 5 metrics at alpha=0.05, there is a ~23% chance of at least one false positive without correction. Always correct when reporting multiple comparisons.
Effect Sizes¶
Effect sizes quantify how large a difference is, independent of sample size. A comparison can be statistically significant but practically negligible, or vice versa. Always report effect sizes alongside p-values.
The following functions are covered in detail in Core Primitives:
| Function | Score type | Interpretation |
|---|---|---|
cohens_d(scores_a, scores_b) |
Continuous | Small: 0.2, Medium: 0.5, Large: 0.8 |
odds_ratio(scores_a, scores_b) |
Binary | How much more likely System B is to succeed |
risk_ratio(scores_a, scores_b) |
Binary | Relative risk of success between systems |
common_language_effect_size(scores_a, scores_b) |
Any | Probability a random draw from B beats A |
from latent.stats import cohens_d, common_language_effect_size
d = cohens_d(scores_baseline, scores_candidate)
print(f"Cohen's d = {d:.2f}") # e.g. 0.45 (medium effect)
cles = common_language_effect_size(scores_baseline, scores_candidate)
print(f"P(candidate > baseline) = {cles:.1%}") # e.g. 62.5%
Tip
common_language_effect_size is the most intuitive measure: "If you pick a
random example, what is the probability that the candidate scores higher than
the baseline?" Use this when presenting results to non-statistical audiences.
Full Example: Model Upgrade Pipeline¶
Putting it all together -- compare a candidate model against the current baseline, enforce quality gates, and log the result:
import sys
import numpy as np
from latent.stats import (
compare_systems,
threshold_gate,
non_inferiority_test,
bootstrap_ci,
common_language_effect_size,
)
# -- 1. Compare systems --
comparison = compare_systems(
baseline_scores, candidate_scores, score_type="binary", seed=42
)
print(comparison.interpretation)
# -- 2. Absolute quality gate --
candidate_metric = bootstrap_ci(candidate_scores, seed=42)
abs_gate = threshold_gate(candidate_metric, threshold=0.85, strictness="lower_ci")
# -- 3. Non-inferiority gate --
ni_gate = non_inferiority_test(
baseline_scores, candidate_scores, margin=0.03, seed=42
)
# -- 4. Effect size for context --
cles = common_language_effect_size(baseline_scores, candidate_scores)
# -- 5. Decision --
if abs_gate.passed and ni_gate.passed:
print(f"DEPLOY: candidate passes all gates (CLES={cles:.1%})")
else:
reasons = []
if not abs_gate.passed:
reasons.append(f"quality below threshold ({abs_gate.actual_value:.3f} < {abs_gate.threshold})")
if not ni_gate.passed:
reasons.append(f"regression exceeds margin ({ni_gate.actual_value:.3f})")
print(f"BLOCK: {', '.join(reasons)}")
sys.exit(1)