Statistical Analysis¶
Rigorous statistical analysis for LLM evaluation pipelines
latent.stats provides production-grade statistical primitives for evaluating LLM systems. Everything from confidence intervals to drift detection, designed for the unique challenges of LLM evaluation: ordinal rubrics, multi-judge agreement, conversation-level metrics, and bias correction for automated judges.
Quick Start¶
Installation¶
Or with uv (recommended):
Basic Usage¶
import numpy as np
from latent.stats import analyze
scores = np.array([4, 5, 3, 5, 4, 5, 4, 3, 5, 4])
report = analyze(scores={"quality": scores}, score_types={"quality": "ordinal"})
for m in report.metrics:
print(f"{m.name}: {m.point_estimate:.3f} [{m.ci_lower:.3f}, {m.ci_upper:.3f}]")
Tip
analyze() is the high-level entry point that picks the right statistical methods
based on your score types. For fine-grained control, use the individual functions directly.
Key Capabilities¶
| Capability | Functions | Use Case |
|---|---|---|
| Bootstrap CIs | bootstrap_ci, paired_bootstrap_ci |
Confidence intervals for any statistic |
| Wilson Intervals | wilson_ci |
Binary proportions near 0% or 100% |
| System Comparison | compare_systems, mcnemars_test |
A/B testing two LLM systems |
| Classification | classification_metrics, per_class_metrics |
Intent detection, routing accuracy |
| Ordinal Scores | ordinal_distribution, binarize |
1-5 rubric scores |
| Bias Correction | ppi_mean, stratified_ppi |
Correcting automated judge bias |
| Inter-Judge Agreement | cohens_kappa, fleiss_kappa, krippendorffs_alpha |
Multi-judge reliability |
| Conversation Metrics | turn_level_scores, turn_position_analysis |
Multi-turn evaluation |
| SOP Compliance | phase_completion_score, sequence_compliance |
Process adherence |
| Drift Detection | detect_drift, drift_report |
Monitoring score degradation |
| Quality Gates | threshold_gate, non_inferiority_test |
CI/CD pass/fail decisions |
| Reporting | render_markdown, log_to_mlflow |
Human-readable reports, experiment tracking |
Common Patterns¶
Comparing Two Systems¶
Run a paired comparison between a baseline and a candidate model, with bootstrap confidence intervals on the difference:
from latent.stats import compare_systems
result = compare_systems(
baseline_scores,
candidate_scores,
score_type="continuous",
seed=42,
)
print(f"Delta: {result.delta:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"p-value: {result.p_value:.4f}")
Note
compare_systems auto-selects the test based on score_type: McNemar's for
binary, Wilcoxon for ordinal, paired bootstrap for continuous. It always uses
paired tests, which are more powerful for LLM evaluation.
CI/CD Quality Gate¶
Use confidence intervals to make pass/fail decisions in your deployment pipeline. This avoids the false confidence of point estimates:
import sys
from latent.stats import analyze, threshold_gate
report = analyze(
scores={"accuracy": acc_scores, "safety": safety_scores},
score_types={"accuracy": "binary", "safety": "binary"},
)
for metric in report.metrics:
gate = threshold_gate(
metric,
threshold={"accuracy": 0.85, "safety": 0.95}[metric.name],
strictness="lower_ci", # pass only if lower bound of CI exceeds threshold
)
if not gate.passed:
print(f"BLOCKED: {gate.metric_name} = {gate.actual_value:.3f} < {gate.threshold}")
sys.exit(1)
Warning
Point-estimate gating (strictness="point_estimate") is available but discouraged.
A model that scores 0.86 on 50 examples could easily have a true accuracy
below 0.80. Use strictness="lower_ci" to gate on statistical confidence.
Bias-Corrected Judge Scores¶
When using an LLM-as-judge, correct for systematic bias using prediction-powered inference (PPI). Provide a small set of human-labeled examples alongside the full set of judge scores:
from latent.stats import ppi_mean
result = ppi_mean(
judge_scores=all_judge_scores, # N judge labels (large)
calibration_judge=judge_subset_scores, # n judge labels on the human-labeled subset
calibration_human=human_subset_scores, # n human labels (small)
)
print(f"Corrected mean: {result.point_estimate:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
Tip
You only need 50-200 human labels to meaningfully correct bias across thousands of judge evaluations. The key requirement is that the human-labeled subset is randomly sampled from the same distribution.
Next Steps¶
Dive deeper into the individual modules:
- Core Primitives -- Bootstrap, Wilson, Bayesian, permutation tests
- Classification & Ordinal -- Classification metrics, structured output, ordinal scores
- System Comparison -- A/B testing, gating, effect sizes
- Bias Correction -- PPI, calibration, inter-judge agreement
- Conversation Metrics -- Turn-level scoring, SOP compliance, trajectories
- Drift & Sampling -- Drift detection, stratified sampling
- Reporting --
analyze(), markdown reports, MLflow integration