Skip to content

Core Statistical Primitives

The building blocks for computing confidence intervals, comparing systems, and quantifying effect sizes. All functions return structured Pydantic models (MetricResult, ComparisonResult, PowerAnalysis) that integrate with the rest of latent.stats.

from latent.stats import (
    bootstrap_ci, paired_bootstrap_ci,
    wilson_ci, beta_binomial_posterior,
    permutation_test, post_hoc_power,
    cohens_d, odds_ratio, risk_ratio, common_language_effect_size,
)

Bootstrap Confidence Intervals

bootstrap_ci(
    scores: np.ndarray,
    statistic: Callable = np.mean,
    n_resamples: int = 10_000,
    confidence_level: float = 0.95,
    seed: int | None = None,
) -> MetricResult

Computes a BCa (bias-corrected and accelerated) bootstrap confidence interval for any statistic. BCa adjusts for both bias and skewness in the bootstrap distribution, producing more accurate intervals than the basic percentile method.

Returns a MetricResult with point_estimate, ci_lower, ci_upper, method="bootstrap_bca", and sample_size.

import numpy as np
from latent.stats import bootstrap_ci

f1_scores = np.array([0.82, 0.79, 0.85, 0.88, 0.76, 0.91, 0.84, 0.80, 0.87, 0.83,
                       0.78, 0.86, 0.81, 0.90, 0.77, 0.84, 0.89, 0.82, 0.85, 0.80])

result = bootstrap_ci(f1_scores, seed=42)
print(f"Mean F1: {result.point_estimate:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
# Mean F1: 0.834
# 95% CI: [0.806, 0.860]

When to use

Use bootstrap BCa as your default for any continuous or binary metric with n >= 20. It works for means, medians, quantiles, or any custom statistic you pass via the statistic parameter.


Paired Bootstrap

paired_bootstrap_ci(
    scores_a: np.ndarray,
    scores_b: np.ndarray,
    statistic: Callable = np.mean,
    n_resamples: int = 10_000,
    confidence_level: float = 0.95,
    seed: int | None = None,
) -> ComparisonResult

Compares two systems evaluated on the same eval set (paired samples). Bootstraps the per-item difference to produce a delta, confidence interval, and two-sided p-value.

Returns a ComparisonResult with delta, ci_lower, ci_upper, p_value, method="paired_bootstrap", and a human-readable interpretation.

from latent.stats import paired_bootstrap_ci

model_a = np.array([0.82, 0.79, 0.85, 0.88, 0.76, 0.91, 0.84, 0.80, 0.87, 0.83])
model_b = np.array([0.85, 0.83, 0.87, 0.90, 0.81, 0.92, 0.86, 0.84, 0.89, 0.86])

result = paired_bootstrap_ci(model_a, model_b, seed=42)
print(f"Delta (B - A): {result.delta:+.3f}")
print(f"95% CI: [{result.ci_lower:+.3f}, {result.ci_upper:+.3f}]")
print(f"p-value: {result.p_value:.4f}")
print(result.interpretation)
# Delta (B - A): +0.033
# 95% CI: [+0.019, +0.047]
# p-value: 0.0002
# System B scores 0.033 higher than A (statistically significant, p=0.0002)

Paired means same items

Both arrays must be the same length, with scores_a[i] and scores_b[i] corresponding to the same eval example. A ValueError is raised if lengths differ.


Wilson Score Intervals

wilson_ci(
    successes: int,
    total: int,
    confidence_level: float = 0.95,
) -> MetricResult

Computes the Wilson score interval for a binomial proportion. Unlike the normal approximation (p +/- z*sqrt(p(1-p)/n)), the Wilson interval is asymmetric and never produces bounds outside [0, 1]. This matters most when the proportion is near 0% or 100%.

from latent.stats import wilson_ci

# 97 out of 100 tests passed
result = wilson_ci(successes=97, total=100)
print(f"Pass rate: {result.point_estimate:.1%}")
print(f"95% CI: [{result.ci_lower:.1%}, {result.ci_upper:.1%}]")
# Pass rate: 97.0%
# 95% CI: [91.5%, 99.0%]

Why not the normal approximation?

With 97/100 successes, the normal approximation gives a CI of [93.7%, 100.3%] -- the upper bound exceeds 100%. Wilson gives [91.5%, 99.0%], which is both valid and better calibrated.


Bayesian Estimation

beta_binomial_posterior(
    successes: int,
    total: int,
    prior_alpha: float = 1.0,
    prior_beta: float = 1.0,
    confidence_level: float = 0.95,
) -> MetricResult

Computes the posterior mean and credible interval from a Beta-Binomial model. The default prior_alpha=1.0, prior_beta=1.0 is a uniform (uninformative) prior. Pass informative priors from previous eval runs to regularize small-sample estimates.

The method field in the result is "bayesian_uninformative" when using default priors, or "bayesian_empirical" when custom priors are provided.

from latent.stats import beta_binomial_posterior

# Small sample: 8 out of 10 passed, with prior from previous run (alpha=20, beta=5)
result = beta_binomial_posterior(
    successes=8, total=10,
    prior_alpha=20.0, prior_beta=5.0,
)
print(f"Posterior mean: {result.point_estimate:.3f}")
print(f"95% credible interval: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
# Posterior mean: 0.800
# 95% credible interval: [0.668, 0.901]

Getting priors from previous runs

Use get_empirical_prior(metric_name, experiment_name) from latent.stats.bayesian to automatically retrieve Beta prior parameters from the most recent MLflow run containing that metric.

from latent.stats.bayesian import get_empirical_prior, beta_binomial_posterior

alpha, beta = get_empirical_prior("pass_rate", experiment_name="my_eval")
result = beta_binomial_posterior(successes=8, total=10, prior_alpha=alpha, prior_beta=beta)

Permutation Test

permutation_test(
    scores_a: np.ndarray,
    scores_b: np.ndarray,
    statistic: Callable = np.mean,
    n_permutations: int = 10_000,
    seed: int | None = None,
) -> ComparisonResult

A paired sign-flip permutation test for comparing two systems. Under the null hypothesis, the assignment of each difference to positive or negative is random. Computes a two-sided p-value from the permutation distribution.

Best for small samples (n < 50) where bootstrap may be unreliable. The p-value is never exactly zero -- it is floored at 1 / n_permutations.

from latent.stats import permutation_test

# 30 samples each -- too small for reliable bootstrap
system_a = np.random.default_rng(42).normal(0.75, 0.1, size=30)
system_b = np.random.default_rng(43).normal(0.78, 0.1, size=30)

result = permutation_test(system_a, system_b, seed=42)
print(f"Delta: {result.delta:+.3f}")
print(f"p-value: {result.p_value:.4f}")
print(result.interpretation)

How it differs from bootstrap

Permutation tests make fewer distributional assumptions than bootstrap. They test a sharp null (no difference at all) rather than constructing a confidence interval. Use bootstrap when you need a CI; use permutation when you need a reliable p-value with small n.


Power Analysis

post_hoc_power(
    sample_size: int,
    observed_effect: float | None = None,
    baseline_rate: float = 0.5,
    alpha: float = 0.05,
    power: float = 0.80,
) -> PowerAnalysis

Answers the question: "Was my sample large enough to detect the effect I observed?" Computes the minimum detectable effect (MDE) for a two-sample proportion test at the given power level. If observed_effect is provided and falls below the MDE, a warning is included.

Returns a PowerAnalysis with sample_size, mde, power, alpha, and an optional warning string.

from latent.stats import post_hoc_power

result = post_hoc_power(sample_size=100, observed_effect=0.03, baseline_rate=0.80)
print(f"MDE at 80% power: {result.mde:.3f}")
if result.warning:
    print(f"Warning: {result.warning}")
# MDE at 80% power: 0.157
# Warning: Observed effect (0.030) is smaller than the minimum detectable effect
#   (0.157) at 80% power. The study is underpowered to detect this effect size.

Post-hoc power is for planning, not interpreting

A non-significant result with low power does not confirm the null. Use the MDE to plan your next eval's sample size: if you need to detect a 5% difference, you need enough samples for mde <= 0.05.


Effect Sizes

Effect sizes quantify the magnitude of a difference, independent of sample size. Always report them alongside p-values.

cohens_d(scores_a, scores_b) -> float

Standardized mean difference using pooled standard deviation. Positive values mean B > A.

Magnitude Cohen's d
Small 0.2
Medium 0.5
Large 0.8
from latent.stats import cohens_d

d = cohens_d(model_a_scores, model_b_scores)
print(f"Cohen's d: {d:.2f}")  # e.g. 0.45 = medium effect

odds_ratio(a_success, a_total, b_success, b_total) -> float

Odds ratio for binary outcomes. Values > 1 mean B has higher odds of success. Includes a 0.5 continuity correction to handle zero cells.

from latent.stats import odds_ratio

# Model A: 85/100 pass, Model B: 92/100 pass
or_val = odds_ratio(85, 100, 92, 100)
print(f"Odds ratio: {or_val:.2f}")  # B has ~2x the odds of passing

risk_ratio(a_success, a_total, b_success, b_total) -> float

Relative risk for binary outcomes. Values > 1 mean B has a higher success rate. More interpretable than odds ratio when the base rate is not rare.

from latent.stats import risk_ratio

rr = risk_ratio(85, 100, 92, 100)
print(f"Risk ratio: {rr:.2f}")  # B's pass rate is 1.08x A's

common_language_effect_size(scores_a, scores_b) -> float

The probability that a randomly drawn score from B exceeds a randomly drawn score from A. Returns a value in [0, 1] where 0.5 means no difference. Easy to explain to non-statisticians.

from latent.stats import common_language_effect_size

cles = common_language_effect_size(model_a_scores, model_b_scores)
print(f"P(B > A): {cles:.0%}")  # e.g. "68% of the time, B beats A"

Which effect size to use

Scenario Effect size Why
Continuous scores, two systems Cohen's d Standard, widely understood
Binary outcomes, moderate rates Risk ratio Intuitive ("B is 1.2x more likely to pass")
Binary outcomes, rare events Odds ratio Stable when base rate is very low
Explaining to stakeholders Common language (CLES) "B beats A X% of the time"

Method Selection Guide

Choosing the right method depends on your sample size, metric type, and what you need to report.

Situation Recommended method Function
Continuous metric, n >= 50 Bootstrap BCa (default) bootstrap_ci
Continuous metric, n >= 20 Bootstrap BCa bootstrap_ci
Comparing two systems, n >= 50 Paired bootstrap paired_bootstrap_ci
Comparing two systems, n < 50 Permutation test permutation_test
Binary proportion (pass/fail) Wilson score interval wilson_ci
Binary, small sample, have priors Bayesian beta-binomial beta_binomial_posterior
Was my sample big enough? Post-hoc power analysis post_hoc_power

Rules of thumb

  1. Always report effect sizes alongside p-values. A significant p-value with a tiny effect size is not actionable.
  2. Bootstrap BCa is the default -- it handles skewed distributions and works for any statistic (mean, median, quantiles, custom).
  3. Use permutation tests for small n -- below 50 samples, bootstrap CIs can have poor coverage.
  4. Use Wilson for binary metrics -- the normal approximation fails near 0% and 100%.
  5. Use Bayesian when you have prior information -- especially for small samples where the prior meaningfully regularizes the estimate.