Statistical Analysis¶

Rigorous statistical analysis for LLM evaluation pipelines

latent.stats provides production-grade statistical primitives for evaluating LLM systems. Everything from confidence intervals to drift detection, designed for the unique challenges of LLM evaluation: ordinal rubrics, multi-judge agreement, conversation-level metrics, and bias correction for automated judges.

Quick Start¶

Installation¶

pip install latent

Or with uv (recommended):

uv add latent

Basic Usage¶

import numpy as np
from latent.stats import analyze

scores = np.array([4, 5, 3, 5, 4, 5, 4, 3, 5, 4])
report = analyze(scores={"quality": scores}, score_types={"quality": "ordinal"})

for m in report.metrics:
    print(f"{m.name}: {m.point_estimate:.3f} [{m.ci_lower:.3f}, {m.ci_upper:.3f}]")

Tip

analyze() is the high-level entry point that picks the right statistical methods based on your score types. For fine-grained control, use the individual functions directly.

Key Capabilities¶

Capability	Functions	Use Case
Bootstrap CIs	`bootstrap_ci`, `paired_bootstrap_ci`	Confidence intervals for any statistic
Wilson Intervals	`wilson_ci`	Binary proportions near 0% or 100%
System Comparison	`compare_systems`, `mcnemars_test`	A/B testing two LLM systems
Classification	`classification_metrics`, `per_class_metrics`	Intent detection, routing accuracy
Ordinal Scores	`ordinal_distribution`, `binarize`	1-5 rubric scores
Bias Correction	`ppi_mean`, `stratified_ppi`	Correcting automated judge bias
Inter-Judge Agreement	`cohens_kappa`, `fleiss_kappa`, `krippendorffs_alpha`	Multi-judge reliability
Conversation Metrics	`turn_level_scores`, `turn_position_analysis`	Multi-turn evaluation
SOP Compliance	`phase_completion_score`, `sequence_compliance`	Process adherence
Drift Detection	`detect_drift`, `drift_report`	Monitoring score degradation
Quality Gates	`threshold_gate`, `non_inferiority_test`	CI/CD pass/fail decisions
Reporting	`render_markdown`, `log_to_mlflow`	Human-readable reports, experiment tracking

Common Patterns¶

Comparing Two Systems¶

Run a paired comparison between a baseline and a candidate model, with bootstrap confidence intervals on the difference:

from latent.stats import compare_systems

result = compare_systems(
    baseline_scores,
    candidate_scores,
    score_type="continuous",
    seed=42,
)

print(f"Delta: {result.delta:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"p-value: {result.p_value:.4f}")

Note

compare_systems auto-selects the test based on score_type: McNemar's for binary, Wilcoxon for ordinal, paired bootstrap for continuous. It always uses paired tests, which are more powerful for LLM evaluation.

CI/CD Quality Gate¶

Use confidence intervals to make pass/fail decisions in your deployment pipeline. This avoids the false confidence of point estimates:

import sys
from latent.stats import analyze, threshold_gate

report = analyze(
    scores={"accuracy": acc_scores, "safety": safety_scores},
    score_types={"accuracy": "binary", "safety": "binary"},
)

for metric in report.metrics:
    gate = threshold_gate(
        metric,
        threshold={"accuracy": 0.85, "safety": 0.95}[metric.name],
        strictness="lower_ci",  # pass only if lower bound of CI exceeds threshold
    )
    if not gate.passed:
        print(f"BLOCKED: {gate.metric_name} = {gate.actual_value:.3f} < {gate.threshold}")
        sys.exit(1)

Warning

Point-estimate gating (strictness="point_estimate") is available but discouraged. A model that scores 0.86 on 50 examples could easily have a true accuracy below 0.80. Use strictness="lower_ci" to gate on statistical confidence.

Bias-Corrected Judge Scores¶

When using an LLM-as-judge, correct for systematic bias using prediction-powered inference (PPI). Provide a small set of human-labeled examples alongside the full set of judge scores:

from latent.stats import ppi_mean

result = ppi_mean(
    judge_scores=all_judge_scores,           # N judge labels (large)
    calibration_judge=judge_subset_scores,   # n judge labels on the human-labeled subset
    calibration_human=human_subset_scores,   # n human labels (small)
)

print(f"Corrected mean: {result.point_estimate:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")

Tip

You only need 50-200 human labels to meaningfully correct bias across thousands of judge evaluations. The key requirement is that the human-labeled subset is randomly sampled from the same distribution.

Next Steps¶

Dive deeper into the individual modules:

Core Primitives -- Bootstrap, Wilson, Bayesian, permutation tests
Classification & Ordinal -- Classification metrics, structured output, ordinal scores
System Comparison -- A/B testing, gating, effect sizes
Bias Correction -- PPI, calibration, inter-judge agreement
Conversation Metrics -- Turn-level scoring, SOP compliance, trajectories
Drift & Sampling -- Drift detection, stratified sampling
Reporting -- analyze(), markdown reports, MLflow integration