Reporting¶
The analyze() orchestrator, markdown summaries, and MLflow integration
latent.stats provides a single entry point -- analyze() -- that runs a full statistical analysis pipeline and returns a structured StatisticalReport. From there, render the results as markdown tables or log them to MLflow for experiment tracking.
The analyze() Function¶
analyze(
scores: dict[str, np.ndarray],
score_types: dict[str, str] | None = None,
rubrics: dict[str, MetricRubric] | None = None,
calibration_data: dict | None = None,
comparison_scores: dict[str, np.ndarray] | None = None,
gates: dict[str, float] | None = None,
n_resamples: int = 10_000,
confidence_level: float = 0.95,
seed: int | None = None,
) -> StatisticalReport
The main entry point that orchestrates a full statistical analysis pipeline. It:
- Auto-detects score types if
score_typesis not provided (binary, ordinal, or continuous) - Computes confidence intervals for each metric using the appropriate method
- Builds ordinal distributions when
rubricsare provided - Applies PPI bias correction when
calibration_datais available - Runs paired comparisons when
comparison_scoresare provided - Evaluates quality gates against the thresholds in
gates - Explains which methods were used and why for each metric
Basic Usage¶
import numpy as np
from latent.stats import analyze
report = analyze(
scores={"accuracy": np.array([1, 1, 0, 1, 1, 0, 1, 1, 1, 0])},
score_types={"accuracy": "binary"},
)
for m in report.metrics:
print(f"{m.name}: {m.point_estimate:.3f} [{m.ci_lower:.3f}, {m.ci_upper:.3f}]")
Full Pipeline¶
from latent.stats import analyze
from latent.stats.config import MetricRubric
report = analyze(
scores={
"accuracy": accuracy_scores,
"faithfulness": faithfulness_scores,
},
score_types={
"accuracy": "binary",
"faithfulness": "ordinal",
},
rubrics={
"faithfulness": MetricRubric(
type="ordinal",
scale=[1, 2, 3, 4, 5],
labels={
1: "Hallucinated",
2: "Mostly wrong",
3: "Partial",
4: "Minor issues",
5: "Faithful",
},
pass_threshold=4,
),
},
calibration_data={
"judge": cal_judge_scores,
"human": cal_human_scores,
},
comparison_scores={
"accuracy": baseline_accuracy,
"faithfulness": baseline_faithfulness,
},
gates={
"accuracy": 0.85,
"faithfulness": 3.5,
},
seed=42,
)
Score type auto-detection
When score_types is omitted, analyze() infers types from the data: arrays
containing only 0s and 1s are classified as "binary", integer arrays with 10
or fewer unique values as "ordinal", and everything else as "continuous".
Pass score_types explicitly if you need to override this heuristic.
explain_method¶
explain_method(
metric_name: str,
score_type: str,
sample_size: int,
has_calibration: bool = False,
is_comparison: bool = False,
) -> str
Returns a human-readable explanation of which statistical method was used for a metric and why it was chosen. Called internally by analyze() to populate the "Methods Used" section, but available for custom pipelines.
from latent.stats import explain_method
print(explain_method("accuracy", "binary", sample_size=200))
# "**accuracy** (binary, n=200): Used Wilson score interval. Unlike the normal
# approximation (p ± z·√(p(1−p)/n)), Wilson intervals are asymmetric and never
# produce bounds outside [0, 1]. This matters most when the proportion is near
# 0% or 100%."
print(explain_method("f1", "continuous", sample_size=15, is_comparison=True))
# "**f1** (continuous, n=15): Used a permutation test because the sample size
# is below 30, where bootstrap confidence intervals can have poor coverage.
# Permutation tests make fewer distributional assumptions and provide reliable
# p-values regardless of sample size."
StatisticalReport Model¶
analyze() returns a StatisticalReport -- a Pydantic model that holds every result from the pipeline. All fields default to empty lists or dicts, so you only get what you asked for.
class StatisticalReport(BaseModel):
metrics: list[MetricResult] = []
comparisons: list[ComparisonResult] = []
gates: list[GatingResult] = []
ordinal_distributions: list[OrdinalDistribution] = []
drift: list[DriftResult] = []
power_analyses: list[PowerAnalysis] = []
method_explanations: list[str] = []
metadata: dict[str, str] = {}
| Field | Populated when | Contains |
|---|---|---|
metrics |
Always | One MetricResult per entry in scores |
comparisons |
comparison_scores provided |
One ComparisonResult per matched metric |
gates |
gates provided |
One GatingResult per matched metric |
ordinal_distributions |
Ordinal metrics with rubrics |
Full level-by-level breakdown |
drift |
Populated by drift_report() |
Severity, delta, p-value per metric |
power_analyses |
Populated by post_hoc_power() |
MDE, power, warnings |
method_explanations |
Always | One explanation of method used and why, per metric |
metadata |
User-provided or tool-generated | Free-form key-value pairs |
Because StatisticalReport is a Pydantic model, serialization is built in:
Markdown Reports¶
render_markdown(report) -> str¶
Produces formatted markdown tables covering every populated section of the report: metrics, comparisons, quality gates, ordinal distributions, drift results, and method explanations.
from latent.stats import analyze, render_markdown
report = analyze(scores={"f1": f1_scores})
print(render_markdown(report))
Output:
# Statistical Report
## Metrics
| metric | estimate | 95% CI | method | n |
|--------|----------|--------|--------|---|
| f1 | 0.8470 | [0.7920, 0.9010] | bootstrap_bca | 100 |
## Methods Used
- **f1** (continuous, n=100): Used BCa (bias-corrected and accelerated) bootstrap CI. BCa adjusts for both bias and skewness in the bootstrap distribution, producing more accurate intervals than the basic percentile method. It works for any statistic without distributional assumptions.
When comparisons or gates are present, additional tables are rendered automatically:
report = analyze(
scores={"accuracy": acc_scores},
comparison_scores={"accuracy": baseline_scores},
gates={"accuracy": 0.85},
seed=42,
)
md = render_markdown(report)
## Comparisons
| metric | delta | CI | p-value | effect size | method |
|--------|-------|----|---------|-------------|--------|
| accuracy | +0.0450 | [0.0120, 0.0780] | 0.0082 | 0.3200 | paired_bootstrap |
## Quality Gates
| metric | passed | threshold | actual | CI | strictness |
|--------|--------|-----------|--------|----|------------|
| accuracy | PASS | 0.8500 | 0.8950 | [0.8610, 0.9290] | lower_ci |
Tip
render_markdown is purely functional -- it takes a StatisticalReport and
returns a string. Pipe the output to a file, print to stdout, or embed it in
a PR comment.
MLflow Integration¶
log_to_mlflow(report) -> None¶
Logs a StatisticalReport to the active MLflow run. This is a no-op if MLflow is not installed or not configured, so it is always safe to call.
What gets logged:
| MLflow concept | What is stored |
|---|---|
| Metrics | Point estimates and CI bounds for each metric (e.g. accuracy_estimate, accuracy_ci_lower, accuracy_ci_upper) |
| Params | Method, sample size, confidence level, calibration status, comparison p-values, gate results |
| Artifacts | statistical_report.json (full Pydantic dump) and statistical_report.md (rendered markdown) |
import mlflow
from latent.stats import analyze, log_to_mlflow
with mlflow.start_run():
report = analyze(scores={"accuracy": scores}, seed=42)
log_to_mlflow(report)
After the run completes, the MLflow UI will show:
- Point estimates and CI bounds as tracked metrics (chartable over time)
- Statistical method and configuration as run parameters
- A downloadable JSON report and human-readable markdown summary as artifacts
Note
Metric names are sanitized before logging: spaces and slashes are replaced
with underscores. A score named "f1 / macro" becomes f1___macro_estimate
in MLflow.
CI/CD Integration¶
Combine analyze() with quality gates to make automated pass/fail decisions in your deployment pipeline. The pattern is straightforward: run the analysis, check the gates, exit non-zero on failure.
import sys
from latent.stats import analyze, render_markdown
report = analyze(
scores=scores,
gates={"accuracy": 0.85, "safety": 0.95},
)
failed_gates = [g for g in report.gates if not g.passed]
if failed_gates:
print(render_markdown(report))
sys.exit(1)
For a more detailed CI script that also logs results:
import sys
import mlflow
from latent.stats import analyze, render_markdown, log_to_mlflow
report = analyze(
scores={
"accuracy": accuracy_scores,
"faithfulness": faithfulness_scores,
},
comparison_scores={
"accuracy": baseline_accuracy,
"faithfulness": baseline_faithfulness,
},
gates={
"accuracy": 0.85,
"faithfulness": 3.5,
},
seed=42,
)
# Always log -- even failures are valuable for tracking trends
with mlflow.start_run():
log_to_mlflow(report)
# Gate check
failed = [g for g in report.gates if not g.passed]
if failed:
print("QUALITY GATE FAILED")
print(render_markdown(report))
for g in failed:
print(f" {g.metric_name}: {g.actual_value:.4f} < {g.threshold:.4f}")
sys.exit(1)
print("All gates passed.")
Warning
Gate strictness defaults to "lower_ci" -- the gate passes only when the
lower bound of the confidence interval exceeds the threshold. This is
intentionally conservative. A point estimate of 0.86 on 50 samples can easily
mask a true accuracy below 0.80.