Skip to content

Statistical Analysis

Rigorous statistical analysis for LLM evaluation pipelines

latent.stats provides production-grade statistical primitives for evaluating LLM systems. Everything from confidence intervals to drift detection, designed for the unique challenges of LLM evaluation: ordinal rubrics, multi-judge agreement, conversation-level metrics, and bias correction for automated judges.

Quick Start

Installation

pip install latent

Or with uv (recommended):

uv add latent

Basic Usage

import numpy as np
from latent.stats import analyze

scores = np.array([4, 5, 3, 5, 4, 5, 4, 3, 5, 4])
report = analyze(scores={"quality": scores}, score_types={"quality": "ordinal"})

for m in report.metrics:
    print(f"{m.name}: {m.point_estimate:.3f} [{m.ci_lower:.3f}, {m.ci_upper:.3f}]")

Tip

analyze() is the high-level entry point that picks the right statistical methods based on your score types. For fine-grained control, use the individual functions directly.

Key Capabilities

Capability Functions Use Case
Bootstrap CIs bootstrap_ci, paired_bootstrap_ci Confidence intervals for any statistic
Wilson Intervals wilson_ci Binary proportions near 0% or 100%
System Comparison compare_systems, mcnemars_test A/B testing two LLM systems
Classification classification_metrics, per_class_metrics Intent detection, routing accuracy
Ordinal Scores ordinal_distribution, binarize 1-5 rubric scores
Bias Correction ppi_mean, stratified_ppi Correcting automated judge bias
Inter-Judge Agreement cohens_kappa, fleiss_kappa, krippendorffs_alpha Multi-judge reliability
Conversation Metrics turn_level_scores, turn_position_analysis Multi-turn evaluation
SOP Compliance phase_completion_score, sequence_compliance Process adherence
Drift Detection detect_drift, drift_report Monitoring score degradation
Quality Gates threshold_gate, non_inferiority_test CI/CD pass/fail decisions
Reporting render_markdown, log_to_mlflow Human-readable reports, experiment tracking

Common Patterns

Comparing Two Systems

Run a paired comparison between a baseline and a candidate model, with bootstrap confidence intervals on the difference:

from latent.stats import compare_systems

result = compare_systems(
    baseline_scores,
    candidate_scores,
    score_type="continuous",
    seed=42,
)

print(f"Delta: {result.delta:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"p-value: {result.p_value:.4f}")

Note

compare_systems auto-selects the test based on score_type: McNemar's for binary, Wilcoxon for ordinal, paired bootstrap for continuous. It always uses paired tests, which are more powerful for LLM evaluation.

CI/CD Quality Gate

Use confidence intervals to make pass/fail decisions in your deployment pipeline. This avoids the false confidence of point estimates:

import sys
from latent.stats import analyze, threshold_gate

report = analyze(
    scores={"accuracy": acc_scores, "safety": safety_scores},
    score_types={"accuracy": "binary", "safety": "binary"},
)

for metric in report.metrics:
    gate = threshold_gate(
        metric,
        threshold={"accuracy": 0.85, "safety": 0.95}[metric.name],
        strictness="lower_ci",  # pass only if lower bound of CI exceeds threshold
    )
    if not gate.passed:
        print(f"BLOCKED: {gate.metric_name} = {gate.actual_value:.3f} < {gate.threshold}")
        sys.exit(1)

Warning

Point-estimate gating (strictness="point_estimate") is available but discouraged. A model that scores 0.86 on 50 examples could easily have a true accuracy below 0.80. Use strictness="lower_ci" to gate on statistical confidence.

Bias-Corrected Judge Scores

When using an LLM-as-judge, correct for systematic bias using prediction-powered inference (PPI). Provide a small set of human-labeled examples alongside the full set of judge scores:

from latent.stats import ppi_mean

result = ppi_mean(
    judge_scores=all_judge_scores,           # N judge labels (large)
    calibration_judge=judge_subset_scores,   # n judge labels on the human-labeled subset
    calibration_human=human_subset_scores,   # n human labels (small)
)

print(f"Corrected mean: {result.point_estimate:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")

Tip

You only need 50-200 human labels to meaningfully correct bias across thousands of judge evaluations. The key requirement is that the human-labeled subset is randomly sampled from the same distribution.

Next Steps

Dive deeper into the individual modules: