Skip to content

Reporting

The analyze() orchestrator, markdown summaries, and MLflow integration

latent.stats provides a single entry point -- analyze() -- that runs a full statistical analysis pipeline and returns a structured StatisticalReport. From there, render the results as markdown tables or log them to MLflow for experiment tracking.

from latent.stats import analyze, render_markdown, log_to_mlflow

The analyze() Function

analyze(
    scores: dict[str, np.ndarray],
    score_types: dict[str, str] | None = None,
    rubrics: dict[str, MetricRubric] | None = None,
    calibration_data: dict | None = None,
    comparison_scores: dict[str, np.ndarray] | None = None,
    gates: dict[str, float] | None = None,
    n_resamples: int = 10_000,
    confidence_level: float = 0.95,
    seed: int | None = None,
) -> StatisticalReport

The main entry point that orchestrates a full statistical analysis pipeline. It:

  1. Auto-detects score types if score_types is not provided (binary, ordinal, or continuous)
  2. Computes confidence intervals for each metric using the appropriate method
  3. Builds ordinal distributions when rubrics are provided
  4. Applies PPI bias correction when calibration_data is available
  5. Runs paired comparisons when comparison_scores are provided
  6. Evaluates quality gates against the thresholds in gates
  7. Explains which methods were used and why for each metric

Basic Usage

import numpy as np
from latent.stats import analyze

report = analyze(
    scores={"accuracy": np.array([1, 1, 0, 1, 1, 0, 1, 1, 1, 0])},
    score_types={"accuracy": "binary"},
)

for m in report.metrics:
    print(f"{m.name}: {m.point_estimate:.3f} [{m.ci_lower:.3f}, {m.ci_upper:.3f}]")

Full Pipeline

from latent.stats import analyze
from latent.stats.config import MetricRubric

report = analyze(
    scores={
        "accuracy": accuracy_scores,
        "faithfulness": faithfulness_scores,
    },
    score_types={
        "accuracy": "binary",
        "faithfulness": "ordinal",
    },
    rubrics={
        "faithfulness": MetricRubric(
            type="ordinal",
            scale=[1, 2, 3, 4, 5],
            labels={
                1: "Hallucinated",
                2: "Mostly wrong",
                3: "Partial",
                4: "Minor issues",
                5: "Faithful",
            },
            pass_threshold=4,
        ),
    },
    calibration_data={
        "judge": cal_judge_scores,
        "human": cal_human_scores,
    },
    comparison_scores={
        "accuracy": baseline_accuracy,
        "faithfulness": baseline_faithfulness,
    },
    gates={
        "accuracy": 0.85,
        "faithfulness": 3.5,
    },
    seed=42,
)

Score type auto-detection

When score_types is omitted, analyze() infers types from the data: arrays containing only 0s and 1s are classified as "binary", integer arrays with 10 or fewer unique values as "ordinal", and everything else as "continuous". Pass score_types explicitly if you need to override this heuristic.

explain_method

explain_method(
    metric_name: str,
    score_type: str,
    sample_size: int,
    has_calibration: bool = False,
    is_comparison: bool = False,
) -> str

Returns a human-readable explanation of which statistical method was used for a metric and why it was chosen. Called internally by analyze() to populate the "Methods Used" section, but available for custom pipelines.

from latent.stats import explain_method

print(explain_method("accuracy", "binary", sample_size=200))
# "**accuracy** (binary, n=200): Used Wilson score interval. Unlike the normal
#  approximation (p ± z·√(p(1−p)/n)), Wilson intervals are asymmetric and never
#  produce bounds outside [0, 1]. This matters most when the proportion is near
#  0% or 100%."

print(explain_method("f1", "continuous", sample_size=15, is_comparison=True))
# "**f1** (continuous, n=15): Used a permutation test because the sample size
#  is below 30, where bootstrap confidence intervals can have poor coverage.
#  Permutation tests make fewer distributional assumptions and provide reliable
#  p-values regardless of sample size."

StatisticalReport Model

analyze() returns a StatisticalReport -- a Pydantic model that holds every result from the pipeline. All fields default to empty lists or dicts, so you only get what you asked for.

class StatisticalReport(BaseModel):
    metrics: list[MetricResult] = []
    comparisons: list[ComparisonResult] = []
    gates: list[GatingResult] = []
    ordinal_distributions: list[OrdinalDistribution] = []
    drift: list[DriftResult] = []
    power_analyses: list[PowerAnalysis] = []
    method_explanations: list[str] = []
    metadata: dict[str, str] = {}
Field Populated when Contains
metrics Always One MetricResult per entry in scores
comparisons comparison_scores provided One ComparisonResult per matched metric
gates gates provided One GatingResult per matched metric
ordinal_distributions Ordinal metrics with rubrics Full level-by-level breakdown
drift Populated by drift_report() Severity, delta, p-value per metric
power_analyses Populated by post_hoc_power() MDE, power, warnings
method_explanations Always One explanation of method used and why, per metric
metadata User-provided or tool-generated Free-form key-value pairs

Because StatisticalReport is a Pydantic model, serialization is built in:

# To JSON
report.model_dump_json(indent=2)

# To dict
report.model_dump()

Markdown Reports

render_markdown(report) -> str

Produces formatted markdown tables covering every populated section of the report: metrics, comparisons, quality gates, ordinal distributions, drift results, and method explanations.

from latent.stats import analyze, render_markdown

report = analyze(scores={"f1": f1_scores})
print(render_markdown(report))

Output:

# Statistical Report

## Metrics

| metric | estimate | 95% CI | method | n |
|--------|----------|--------|--------|---|
| f1 | 0.8470 | [0.7920, 0.9010] | bootstrap_bca | 100 |

## Methods Used

- **f1** (continuous, n=100): Used BCa (bias-corrected and accelerated) bootstrap CI. BCa adjusts for both bias and skewness in the bootstrap distribution, producing more accurate intervals than the basic percentile method. It works for any statistic without distributional assumptions.

When comparisons or gates are present, additional tables are rendered automatically:

report = analyze(
    scores={"accuracy": acc_scores},
    comparison_scores={"accuracy": baseline_scores},
    gates={"accuracy": 0.85},
    seed=42,
)
md = render_markdown(report)
## Comparisons

| metric | delta | CI | p-value | effect size | method |
|--------|-------|----|---------|-------------|--------|
| accuracy | +0.0450 | [0.0120, 0.0780] | 0.0082 | 0.3200 | paired_bootstrap |

## Quality Gates

| metric | passed | threshold | actual | CI | strictness |
|--------|--------|-----------|--------|----|------------|
| accuracy | PASS | 0.8500 | 0.8950 | [0.8610, 0.9290] | lower_ci |

Tip

render_markdown is purely functional -- it takes a StatisticalReport and returns a string. Pipe the output to a file, print to stdout, or embed it in a PR comment.


MLflow Integration

log_to_mlflow(report) -> None

Logs a StatisticalReport to the active MLflow run. This is a no-op if MLflow is not installed or not configured, so it is always safe to call.

What gets logged:

MLflow concept What is stored
Metrics Point estimates and CI bounds for each metric (e.g. accuracy_estimate, accuracy_ci_lower, accuracy_ci_upper)
Params Method, sample size, confidence level, calibration status, comparison p-values, gate results
Artifacts statistical_report.json (full Pydantic dump) and statistical_report.md (rendered markdown)
import mlflow
from latent.stats import analyze, log_to_mlflow

with mlflow.start_run():
    report = analyze(scores={"accuracy": scores}, seed=42)
    log_to_mlflow(report)

After the run completes, the MLflow UI will show:

  • Point estimates and CI bounds as tracked metrics (chartable over time)
  • Statistical method and configuration as run parameters
  • A downloadable JSON report and human-readable markdown summary as artifacts

Note

Metric names are sanitized before logging: spaces and slashes are replaced with underscores. A score named "f1 / macro" becomes f1___macro_estimate in MLflow.


CI/CD Integration

Combine analyze() with quality gates to make automated pass/fail decisions in your deployment pipeline. The pattern is straightforward: run the analysis, check the gates, exit non-zero on failure.

import sys
from latent.stats import analyze, render_markdown

report = analyze(
    scores=scores,
    gates={"accuracy": 0.85, "safety": 0.95},
)

failed_gates = [g for g in report.gates if not g.passed]
if failed_gates:
    print(render_markdown(report))
    sys.exit(1)

For a more detailed CI script that also logs results:

import sys
import mlflow
from latent.stats import analyze, render_markdown, log_to_mlflow

report = analyze(
    scores={
        "accuracy": accuracy_scores,
        "faithfulness": faithfulness_scores,
    },
    comparison_scores={
        "accuracy": baseline_accuracy,
        "faithfulness": baseline_faithfulness,
    },
    gates={
        "accuracy": 0.85,
        "faithfulness": 3.5,
    },
    seed=42,
)

# Always log -- even failures are valuable for tracking trends
with mlflow.start_run():
    log_to_mlflow(report)

# Gate check
failed = [g for g in report.gates if not g.passed]
if failed:
    print("QUALITY GATE FAILED")
    print(render_markdown(report))
    for g in failed:
        print(f"  {g.metric_name}: {g.actual_value:.4f} < {g.threshold:.4f}")
    sys.exit(1)

print("All gates passed.")

Warning

Gate strictness defaults to "lower_ci" -- the gate passes only when the lower bound of the confidence interval exceeds the threshold. This is intentionally conservative. A point estimate of 0.86 on 50 samples can easily mask a true accuracy below 0.80.