Classification, Structured Output & Ordinal Score Metrics¶

Metrics for evaluating discrete predictions, structured LLM outputs, and ordinal scoring rubrics. All functions return MetricResult objects with point estimates and confidence intervals.

Classification Metrics¶

`classification_metrics`¶

from latent.stats import classification_metrics

metrics = classification_metrics(
    y_pred=predictions,
    y_true=actuals,
    labels=None,           # inferred from data when None
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

Computes accuracy, macro/micro/weighted F1, macro precision, macro recall, and MCC. Every metric comes with a BCa bootstrap confidence interval.

Parameters

Name	Type	Default	Description
`y_pred`	`list \\| np.ndarray`	required	Predicted labels
`y_true`	`list \\| np.ndarray`	required	Ground-truth labels
`labels`	`list[str] \\| None`	`None`	Ordered label names. Inferred from data when `None`.
`n_resamples`	`int`	`10_000`	Bootstrap resamples
`confidence_level`	`float`	`0.95`	CI confidence level
`seed`	`int \\| None`	`None`	Random seed for reproducibility

Returns: dict[str, MetricResult] with keys accuracy, precision_macro, recall_macro, f1_macro, f1_micro, f1_weighted, mcc.

Example -- intent classification with 5 classes:

from latent.stats import classification_metrics

predictions = ["book_flight", "weather", "book_flight", "music", "alarm", ...]
actuals     = ["book_flight", "weather", "alarm",       "music", "alarm", ...]

metrics = classification_metrics(y_pred=predictions, y_true=actuals)

for name, result in metrics.items():
    print(f"{name}: {result.point_estimate:.3f} "
          f"[{result.ci_lower:.3f}, {result.ci_upper:.3f}]")

accuracy: 0.823 [0.791, 0.854]
precision_macro: 0.810 [0.772, 0.845]
recall_macro: 0.805 [0.768, 0.840]
f1_macro: 0.807 [0.770, 0.842]
f1_micro: 0.823 [0.791, 0.854]
f1_weighted: 0.821 [0.789, 0.852]
mcc: 0.776 [0.735, 0.814]

`per_class_metrics`¶

from latent.stats import per_class_metrics

per_class = per_class_metrics(
    y_pred=predictions,
    y_true=actuals,
    labels=None,
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

Returns per-class precision, recall, and F1 -- each with bootstrap CIs.

Returns: dict[str, dict[str, MetricResult]] -- outer key is the class label, inner keys are precision, recall, f1.

Example -- finding which intent class underperforms:

for label, label_metrics in per_class.items():
    f1 = label_metrics["f1"]
    flag = " << low" if f1.point_estimate < 0.70 else ""
    print(f"  {label:20s} F1={f1.point_estimate:.3f} "
          f"[{f1.ci_lower:.3f}, {f1.ci_upper:.3f}]{flag}")

  alarm                F1=0.912 [0.871, 0.946]
  book_flight          F1=0.854 [0.810, 0.893]
  music                F1=0.790 [0.738, 0.836]
  play_podcast         F1=0.643 [0.581, 0.702] << low
  weather              F1=0.838 [0.793, 0.878]

`confusion_matrix_with_ci`¶

from latent.stats import confusion_matrix_with_ci

cm = confusion_matrix_with_ci(
    y_pred=predictions,
    y_true=actuals,
    labels=None,
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

Builds a standard confusion matrix and adds per-cell bootstrap CIs.

Returns: ConfusionMatrixResult with fields:

Field	Type	Description
`matrix`	`list[list[int]]`	Raw confusion matrix counts
`ci_lower`	`list[list[float]]`	Per-cell lower CI bounds
`ci_upper`	`list[list[float]]`	Per-cell upper CI bounds
`labels`	`list[str]`	Ordered class labels

Structured Output Metrics¶

Use these when your LLM produces JSON, Pydantic models, or other structured data.

`schema_compliance`¶

from latent.stats import schema_compliance

result = schema_compliance(
    outputs=llm_outputs,   # list of dicts
    schema=json_schema,    # JSON Schema dict
)

Validates each output against a JSON schema and returns the compliance rate with a Wilson CI.

Example -- checking whether an LLM returns valid JSON:

schema = {
    "type": "object",
    "properties": {
        "intent": {"type": "string"},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
    },
    "required": ["intent", "confidence"],
}

result = schema_compliance(outputs=llm_outputs, schema=schema)
print(f"Schema compliance: {result.point_estimate:.1%} "
      f"[{result.ci_lower:.1%}, {result.ci_upper:.1%}]")

Schema compliance: 94.2% [91.8%, 96.1%]

`field_level_accuracy`¶

from latent.stats import field_level_accuracy

results = field_level_accuracy(
    outputs=predicted_dicts,
    expected=ground_truth_dicts,
    fields=None,  # evaluate all fields when None
)

Per-field exact-match accuracy with Wilson CIs. Useful for comparing extracted entities field by field.

Returns: dict[str, MetricResult] keyed by field name.

Example:

predicted = [
    {"name": "Alice", "city": "NYC",    "age": 30},
    {"name": "Bob",   "city": "Boston", "age": 25},
]
expected = [
    {"name": "Alice", "city": "NYC",    "age": 31},
    {"name": "Bob",   "city": "LA",     "age": 25},
]

for field, m in field_level_accuracy(predicted, expected).items():
    print(f"  {field}: {m.point_estimate:.0%}")

  age: 50%
  city: 50%
  name: 100%

`composite_accuracy`¶

from latent.stats import composite_accuracy

result = composite_accuracy(
    outputs=predicted_dicts,
    expected=ground_truth_dicts,
    fields=None,
)

For each row, computes the fraction of fields that match exactly, then aggregates with a bootstrap CI. A row counts as fully correct only when every evaluated field matches.

`result_set_match`¶

from latent.stats import result_set_match

result = result_set_match(
    actual_sets=predicted_sets,    # list[list[dict]]
    expected_sets=ground_truth_sets,
)

Order-independent set matching. Each pair of (actual, expected) is compared as an unordered set of rows. Returns a Wilson CI on the exact-match rate.

Example -- comparing SQL query result sets:

predicted_sets = [
    [{"id": 1, "name": "A"}, {"id": 2, "name": "B"}],
    [{"id": 3, "name": "C"}],
]
expected_sets = [
    [{"id": 2, "name": "B"}, {"id": 1, "name": "A"}],  # same rows, different order
    [{"id": 3, "name": "D"}],                            # mismatch
]

m = result_set_match(predicted_sets, expected_sets)
print(f"Set match rate: {m.point_estimate:.0%}")  # 50%

Ordinal Scores¶

Ordinal metrics handle Likert-style or rubric-graded scores (e.g. 1--5 faithfulness).

`ordinal_distribution`¶

from latent.stats import ordinal_distribution
from latent.stats.config import MetricRubric

rubric = MetricRubric(
    type="ordinal",
    scale=[1, 2, 3, 4, 5],
    labels={
        1: "Hallucinated",
        2: "Mostly wrong",
        3: "Partial",
        4: "Minor issues",
        5: "Faithful",
    },
    pass_threshold=4,
)

dist = ordinal_distribution(
    scores=scores,
    rubric=rubric,
    metric_name="faithfulness",
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

print(f"Pass rate: {dist.pass_rate:.1%}")
print(f"Median: {dist.median} [{dist.median_ci_lower}, {dist.median_ci_upper}]")
for level in dist.levels:
    marker = "  *" if level.is_passing else ""
    print(f"  {level.label}: {level.proportion:.1%} "
          f"[{level.ci_lower:.1%}, {level.ci_upper:.1%}]{marker}")

Returns: OrdinalDistribution with:

Field	Type	Description
`pass_rate`	`float`	Proportion of scores >= `pass_threshold`
`pass_rate_ci_lower` / `pass_rate_ci_upper`	`float`	Wilson CI on pass rate
`pass_threshold`	`int`	Threshold used
`levels`	`list[OrdinalLevel]`	Per-level proportion, CI, label, `is_passing` flag
`median`	`int`	Median score
`median_ci_lower` / `median_ci_upper`	`int`	Bootstrap CI on median
`cumulative`	`dict[int, MetricResult]`	Cumulative proportions (% >= X) per level

Tip

If you omit the rubric, one is inferred from the observed score values with generic labels. You will see a warning recommending explicit configuration.

`binarize`¶

from latent.stats import binarize

binary = binarize(scores, threshold=4)
# array([0., 0., 1., 1., 1., ...])

Converts ordinal scores to binary pass/fail (1 if >= threshold, else 0). Returns a float np.ndarray.

`median_ci`¶

from latent.stats import median_ci

med, ci_lo, ci_hi = median_ci(scores, n_resamples=10_000, confidence_level=0.95, seed=None)
print(f"Median: {med} [{ci_lo}, {ci_hi}]")

Bootstrap CI on the median. Returns a tuple of three ints (median, ci_lower, ci_upper).

`cumulative_proportions`¶

from latent.stats import cumulative_proportions

cumul = cumulative_proportions(
    scores,
    scale=[1, 2, 3, 4, 5],
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

for level, m in cumul.items():
    print(f"  >= {level}: {m.point_estimate:.1%} "
          f"[{m.ci_lower:.1%}, {m.ci_upper:.1%}]")

Answers "what percentage scored >= X?" for each level. Returns dict[int, MetricResult] with bootstrap CIs.

Rubric Configuration¶

Define rubrics in a YAML file to keep scoring criteria out of code:

metrics:
  faithfulness:
    type: ordinal
    scale: [1, 2, 3, 4, 5]
    labels:
      1: Hallucinated
      2: Mostly wrong
      3: Partially faithful
      4: Minor issues
      5: Fully faithful
    pass_threshold: 4

  helpfulness:
    type: ordinal
    scale: [1, 2, 3, 4, 5]
    labels:
      1: Unhelpful
      2: Slightly helpful
      3: Moderately helpful
      4: Helpful
      5: Very helpful
    pass_threshold: 3

Load with load_rubric_config:

from latent.stats.config import load_rubric_config

config = load_rubric_config("rubrics.yaml")
faithfulness_rubric = config.metrics["faithfulness"]

dist = ordinal_distribution(scores, rubric=faithfulness_rubric, metric_name="faithfulness")

The RubricConfig object exposes a metrics dict mapping each metric name to a MetricRubric. Each rubric has:

Field	Type	Description
`type`	`str`	`"ordinal"` or `"binary"`
`scale`	`list[int] \\| None`	Ordered list of valid score values
`labels`	`dict[int, str] \\| None`	Human-readable label per level
`pass_threshold`	`int \\| None`	Minimum score that counts as passing

Classification, Structured Output & Ordinal Score Metrics¶

Classification Metrics¶

classification_metrics¶

per_class_metrics¶

confusion_matrix_with_ci¶

Structured Output Metrics¶

schema_compliance¶

field_level_accuracy¶

composite_accuracy¶

result_set_match¶