Reporting¶

The analyze() orchestrator, markdown summaries, and MLflow integration

latent.stats provides a single entry point -- analyze() -- that runs a full statistical analysis pipeline and returns a structured StatisticalReport. From there, render the results as markdown tables or log them to MLflow for experiment tracking.

from latent.stats import analyze, render_markdown, log_to_mlflow

The `analyze()` Function¶

analyze(
    scores: dict[str, np.ndarray],
    score_types: dict[str, str] | None = None,
    rubrics: dict[str, MetricRubric] | None = None,
    calibration_data: dict | None = None,
    comparison_scores: dict[str, np.ndarray] | None = None,
    gates: dict[str, float] | None = None,
    n_resamples: int = 10_000,
    confidence_level: float = 0.95,
    seed: int | None = None,
) -> StatisticalReport

The main entry point that orchestrates a full statistical analysis pipeline. It:

Auto-detects score types if score_types is not provided (binary, ordinal, or continuous)
Computes confidence intervals for each metric using the appropriate method
Builds ordinal distributions when rubrics are provided
Applies PPI bias correction when calibration_data is available
Runs paired comparisons when comparison_scores are provided
Evaluates quality gates against the thresholds in gates
Explains which methods were used and why for each metric

Basic Usage¶

import numpy as np
from latent.stats import analyze

report = analyze(
    scores={"accuracy": np.array([1, 1, 0, 1, 1, 0, 1, 1, 1, 0])},
    score_types={"accuracy": "binary"},
)

for m in report.metrics:
    print(f"{m.name}: {m.point_estimate:.3f} [{m.ci_lower:.3f}, {m.ci_upper:.3f}]")

Full Pipeline¶

from latent.stats import analyze
from latent.stats.config import MetricRubric

report = analyze(
    scores={
        "accuracy": accuracy_scores,
        "faithfulness": faithfulness_scores,
    },
    score_types={
        "accuracy": "binary",
        "faithfulness": "ordinal",
    },
    rubrics={
        "faithfulness": MetricRubric(
            type="ordinal",
            scale=[1, 2, 3, 4, 5],
            labels={
                1: "Hallucinated",
                2: "Mostly wrong",
                3: "Partial",
                4: "Minor issues",
                5: "Faithful",
            },
            pass_threshold=4,
        ),
    },
    calibration_data={
        "judge": cal_judge_scores,
        "human": cal_human_scores,
    },
    comparison_scores={
        "accuracy": baseline_accuracy,
        "faithfulness": baseline_faithfulness,
    },
    gates={
        "accuracy": 0.85,
        "faithfulness": 3.5,
    },
    seed=42,
)

Score type auto-detection

When score_types is omitted, analyze() infers types from the data: arrays containing only 0s and 1s are classified as "binary", integer arrays with 10 or fewer unique values as "ordinal", and everything else as "continuous". Pass score_types explicitly if you need to override this heuristic.

`explain_method`¶

explain_method(
    metric_name: str,
    score_type: str,
    sample_size: int,
    has_calibration: bool = False,
    is_comparison: bool = False,
) -> str

Returns a human-readable explanation of which statistical method was used for a metric and why it was chosen. Called internally by analyze() to populate the "Methods Used" section, but available for custom pipelines.

from latent.stats import explain_method

print(explain_method("accuracy", "binary", sample_size=200))
# "**accuracy** (binary, n=200): Used Wilson score interval. Unlike the normal
#  approximation (p ± z·√(p(1−p)/n)), Wilson intervals are asymmetric and never
#  produce bounds outside [0, 1]. This matters most when the proportion is near
#  0% or 100%."

print(explain_method("f1", "continuous", sample_size=15, is_comparison=True))
# "**f1** (continuous, n=15): Used a permutation test because the sample size
#  is below 30, where bootstrap confidence intervals can have poor coverage.
#  Permutation tests make fewer distributional assumptions and provide reliable
#  p-values regardless of sample size."

`StatisticalReport` Model¶

analyze() returns a StatisticalReport -- a Pydantic model that holds every result from the pipeline. All fields default to empty lists or dicts, so you only get what you asked for.

class StatisticalReport(BaseModel):
    metrics: list[MetricResult] = []
    comparisons: list[ComparisonResult] = []
    gates: list[GatingResult] = []
    ordinal_distributions: list[OrdinalDistribution] = []
    drift: list[DriftResult] = []
    power_analyses: list[PowerAnalysis] = []
    method_explanations: list[str] = []
    metadata: dict[str, str] = {}

Field	Populated when	Contains
`metrics`	Always	One `MetricResult` per entry in `scores`
`comparisons`	`comparison_scores` provided	One `ComparisonResult` per matched metric
`gates`	`gates` provided	One `GatingResult` per matched metric
`ordinal_distributions`	Ordinal metrics with `rubrics`	Full level-by-level breakdown
`drift`	Populated by `drift_report()`	Severity, delta, p-value per metric
`power_analyses`	Populated by `post_hoc_power()`	MDE, power, warnings
`method_explanations`	Always	One explanation of method used and why, per metric
`metadata`	User-provided or tool-generated	Free-form key-value pairs

Because StatisticalReport is a Pydantic model, serialization is built in:

# To JSON
report.model_dump_json(indent=2)

# To dict
report.model_dump()

Markdown Reports¶

`render_markdown(report) -> str`¶

Produces formatted markdown tables covering every populated section of the report: metrics, comparisons, quality gates, ordinal distributions, drift results, and method explanations.

from latent.stats import analyze, render_markdown

report = analyze(scores={"f1": f1_scores})
print(render_markdown(report))

Output:

# Statistical Report

## Metrics

| metric | estimate | 95% CI | method | n |
|--------|----------|--------|--------|---|
| f1 | 0.8470 | [0.7920, 0.9010] | bootstrap_bca | 100 |

## Methods Used

- **f1** (continuous, n=100): Used BCa (bias-corrected and accelerated) bootstrap CI. BCa adjusts for both bias and skewness in the bootstrap distribution, producing more accurate intervals than the basic percentile method. It works for any statistic without distributional assumptions.

When comparisons or gates are present, additional tables are rendered automatically:

report = analyze(
    scores={"accuracy": acc_scores},
    comparison_scores={"accuracy": baseline_scores},
    gates={"accuracy": 0.85},
    seed=42,
)
md = render_markdown(report)

## Comparisons

| metric | delta | CI | p-value | effect size | method |
|--------|-------|----|---------|-------------|--------|
| accuracy | +0.0450 | [0.0120, 0.0780] | 0.0082 | 0.3200 | paired_bootstrap |

## Quality Gates

| metric | passed | threshold | actual | CI | strictness |
|--------|--------|-----------|--------|----|------------|
| accuracy | PASS | 0.8500 | 0.8950 | [0.8610, 0.9290] | lower_ci |

Tip

render_markdown is purely functional -- it takes a StatisticalReport and returns a string. Pipe the output to a file, print to stdout, or embed it in a PR comment.

MLflow Integration¶

`log_to_mlflow(report) -> None`¶

Logs a StatisticalReport to the active MLflow run. This is a no-op if MLflow is not installed or not configured, so it is always safe to call.

What gets logged:

MLflow concept	What is stored
Metrics	Point estimates and CI bounds for each metric (e.g. `accuracy_estimate`, `accuracy_ci_lower`, `accuracy_ci_upper`)
Params	Method, sample size, confidence level, calibration status, comparison p-values, gate results
Artifacts	`statistical_report.json` (full Pydantic dump) and `statistical_report.md` (rendered markdown)

import mlflow
from latent.stats import analyze, log_to_mlflow

with mlflow.start_run():
    report = analyze(scores={"accuracy": scores}, seed=42)
    log_to_mlflow(report)

After the run completes, the MLflow UI will show:

Point estimates and CI bounds as tracked metrics (chartable over time)
Statistical method and configuration as run parameters
A downloadable JSON report and human-readable markdown summary as artifacts

Note

Metric names are sanitized before logging: spaces and slashes are replaced with underscores. A score named "f1 / macro" becomes f1___macro_estimate in MLflow.

CI/CD Integration¶

Combine analyze() with quality gates to make automated pass/fail decisions in your deployment pipeline. The pattern is straightforward: run the analysis, check the gates, exit non-zero on failure.

import sys
from latent.stats import analyze, render_markdown

report = analyze(
    scores=scores,
    gates={"accuracy": 0.85, "safety": 0.95},
)

failed_gates = [g for g in report.gates if not g.passed]
if failed_gates:
    print(render_markdown(report))
    sys.exit(1)

For a more detailed CI script that also logs results:

import sys
import mlflow
from latent.stats import analyze, render_markdown, log_to_mlflow

report = analyze(
    scores={
        "accuracy": accuracy_scores,
        "faithfulness": faithfulness_scores,
    },
    comparison_scores={
        "accuracy": baseline_accuracy,
        "faithfulness": baseline_faithfulness,
    },
    gates={
        "accuracy": 0.85,
        "faithfulness": 3.5,
    },
    seed=42,
)

# Always log -- even failures are valuable for tracking trends
with mlflow.start_run():
    log_to_mlflow(report)

# Gate check
failed = [g for g in report.gates if not g.passed]
if failed:
    print("QUALITY GATE FAILED")
    print(render_markdown(report))
    for g in failed:
        print(f"  {g.metric_name}: {g.actual_value:.4f} < {g.threshold:.4f}")
    sys.exit(1)

print("All gates passed.")

Warning

Gate strictness defaults to "lower_ci" -- the gate passes only when the lower bound of the confidence interval exceeds the threshold. This is intentionally conservative. A point estimate of 0.86 on 50 samples can easily mask a true accuracy below 0.80.

Reporting¶

The analyze() Function¶

Basic Usage¶

Full Pipeline¶

explain_method¶

StatisticalReport Model¶