Skip to content

Classification, Structured Output & Ordinal Score Metrics

Metrics for evaluating discrete predictions, structured LLM outputs, and ordinal scoring rubrics. All functions return MetricResult objects with point estimates and confidence intervals.


Classification Metrics

classification_metrics

from latent.stats import classification_metrics

metrics = classification_metrics(
    y_pred=predictions,
    y_true=actuals,
    labels=None,           # inferred from data when None
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

Computes accuracy, macro/micro/weighted F1, macro precision, macro recall, and MCC. Every metric comes with a BCa bootstrap confidence interval.

Parameters

Name Type Default Description
y_pred list \| np.ndarray required Predicted labels
y_true list \| np.ndarray required Ground-truth labels
labels list[str] \| None None Ordered label names. Inferred from data when None.
n_resamples int 10_000 Bootstrap resamples
confidence_level float 0.95 CI confidence level
seed int \| None None Random seed for reproducibility

Returns: dict[str, MetricResult] with keys accuracy, precision_macro, recall_macro, f1_macro, f1_micro, f1_weighted, mcc.

Example -- intent classification with 5 classes:

from latent.stats import classification_metrics

predictions = ["book_flight", "weather", "book_flight", "music", "alarm", ...]
actuals     = ["book_flight", "weather", "alarm",       "music", "alarm", ...]

metrics = classification_metrics(y_pred=predictions, y_true=actuals)

for name, result in metrics.items():
    print(f"{name}: {result.point_estimate:.3f} "
          f"[{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
accuracy: 0.823 [0.791, 0.854]
precision_macro: 0.810 [0.772, 0.845]
recall_macro: 0.805 [0.768, 0.840]
f1_macro: 0.807 [0.770, 0.842]
f1_micro: 0.823 [0.791, 0.854]
f1_weighted: 0.821 [0.789, 0.852]
mcc: 0.776 [0.735, 0.814]

per_class_metrics

from latent.stats import per_class_metrics

per_class = per_class_metrics(
    y_pred=predictions,
    y_true=actuals,
    labels=None,
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

Returns per-class precision, recall, and F1 -- each with bootstrap CIs.

Returns: dict[str, dict[str, MetricResult]] -- outer key is the class label, inner keys are precision, recall, f1.

Example -- finding which intent class underperforms:

for label, label_metrics in per_class.items():
    f1 = label_metrics["f1"]
    flag = " << low" if f1.point_estimate < 0.70 else ""
    print(f"  {label:20s} F1={f1.point_estimate:.3f} "
          f"[{f1.ci_lower:.3f}, {f1.ci_upper:.3f}]{flag}")
  alarm                F1=0.912 [0.871, 0.946]
  book_flight          F1=0.854 [0.810, 0.893]
  music                F1=0.790 [0.738, 0.836]
  play_podcast         F1=0.643 [0.581, 0.702] << low
  weather              F1=0.838 [0.793, 0.878]

confusion_matrix_with_ci

from latent.stats import confusion_matrix_with_ci

cm = confusion_matrix_with_ci(
    y_pred=predictions,
    y_true=actuals,
    labels=None,
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

Builds a standard confusion matrix and adds per-cell bootstrap CIs.

Returns: ConfusionMatrixResult with fields:

Field Type Description
matrix list[list[int]] Raw confusion matrix counts
ci_lower list[list[float]] Per-cell lower CI bounds
ci_upper list[list[float]] Per-cell upper CI bounds
labels list[str] Ordered class labels

Structured Output Metrics

Use these when your LLM produces JSON, Pydantic models, or other structured data.

schema_compliance

from latent.stats import schema_compliance

result = schema_compliance(
    outputs=llm_outputs,   # list of dicts
    schema=json_schema,    # JSON Schema dict
)

Validates each output against a JSON schema and returns the compliance rate with a Wilson CI.

Example -- checking whether an LLM returns valid JSON:

schema = {
    "type": "object",
    "properties": {
        "intent": {"type": "string"},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
    },
    "required": ["intent", "confidence"],
}

result = schema_compliance(outputs=llm_outputs, schema=schema)
print(f"Schema compliance: {result.point_estimate:.1%} "
      f"[{result.ci_lower:.1%}, {result.ci_upper:.1%}]")
Schema compliance: 94.2% [91.8%, 96.1%]

field_level_accuracy

from latent.stats import field_level_accuracy

results = field_level_accuracy(
    outputs=predicted_dicts,
    expected=ground_truth_dicts,
    fields=None,  # evaluate all fields when None
)

Per-field exact-match accuracy with Wilson CIs. Useful for comparing extracted entities field by field.

Returns: dict[str, MetricResult] keyed by field name.

Example:

predicted = [
    {"name": "Alice", "city": "NYC",    "age": 30},
    {"name": "Bob",   "city": "Boston", "age": 25},
]
expected = [
    {"name": "Alice", "city": "NYC",    "age": 31},
    {"name": "Bob",   "city": "LA",     "age": 25},
]

for field, m in field_level_accuracy(predicted, expected).items():
    print(f"  {field}: {m.point_estimate:.0%}")
  age: 50%
  city: 50%
  name: 100%

composite_accuracy

from latent.stats import composite_accuracy

result = composite_accuracy(
    outputs=predicted_dicts,
    expected=ground_truth_dicts,
    fields=None,
)

For each row, computes the fraction of fields that match exactly, then aggregates with a bootstrap CI. A row counts as fully correct only when every evaluated field matches.


result_set_match

from latent.stats import result_set_match

result = result_set_match(
    actual_sets=predicted_sets,    # list[list[dict]]
    expected_sets=ground_truth_sets,
)

Order-independent set matching. Each pair of (actual, expected) is compared as an unordered set of rows. Returns a Wilson CI on the exact-match rate.

Example -- comparing SQL query result sets:

predicted_sets = [
    [{"id": 1, "name": "A"}, {"id": 2, "name": "B"}],
    [{"id": 3, "name": "C"}],
]
expected_sets = [
    [{"id": 2, "name": "B"}, {"id": 1, "name": "A"}],  # same rows, different order
    [{"id": 3, "name": "D"}],                            # mismatch
]

m = result_set_match(predicted_sets, expected_sets)
print(f"Set match rate: {m.point_estimate:.0%}")  # 50%

Ordinal Scores

Ordinal metrics handle Likert-style or rubric-graded scores (e.g. 1--5 faithfulness).

ordinal_distribution

from latent.stats import ordinal_distribution
from latent.stats.config import MetricRubric

rubric = MetricRubric(
    type="ordinal",
    scale=[1, 2, 3, 4, 5],
    labels={
        1: "Hallucinated",
        2: "Mostly wrong",
        3: "Partial",
        4: "Minor issues",
        5: "Faithful",
    },
    pass_threshold=4,
)

dist = ordinal_distribution(
    scores=scores,
    rubric=rubric,
    metric_name="faithfulness",
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

print(f"Pass rate: {dist.pass_rate:.1%}")
print(f"Median: {dist.median} [{dist.median_ci_lower}, {dist.median_ci_upper}]")
for level in dist.levels:
    marker = "  *" if level.is_passing else ""
    print(f"  {level.label}: {level.proportion:.1%} "
          f"[{level.ci_lower:.1%}, {level.ci_upper:.1%}]{marker}")

Returns: OrdinalDistribution with:

Field Type Description
pass_rate float Proportion of scores >= pass_threshold
pass_rate_ci_lower / pass_rate_ci_upper float Wilson CI on pass rate
pass_threshold int Threshold used
levels list[OrdinalLevel] Per-level proportion, CI, label, is_passing flag
median int Median score
median_ci_lower / median_ci_upper int Bootstrap CI on median
cumulative dict[int, MetricResult] Cumulative proportions (% >= X) per level

Tip

If you omit the rubric, one is inferred from the observed score values with generic labels. You will see a warning recommending explicit configuration.


binarize

from latent.stats import binarize

binary = binarize(scores, threshold=4)
# array([0., 0., 1., 1., 1., ...])

Converts ordinal scores to binary pass/fail (1 if >= threshold, else 0). Returns a float np.ndarray.


median_ci

from latent.stats import median_ci

med, ci_lo, ci_hi = median_ci(scores, n_resamples=10_000, confidence_level=0.95, seed=None)
print(f"Median: {med} [{ci_lo}, {ci_hi}]")

Bootstrap CI on the median. Returns a tuple of three ints (median, ci_lower, ci_upper).


cumulative_proportions

from latent.stats import cumulative_proportions

cumul = cumulative_proportions(
    scores,
    scale=[1, 2, 3, 4, 5],
    n_resamples=10_000,
    confidence_level=0.95,
    seed=None,
)

for level, m in cumul.items():
    print(f"  >= {level}: {m.point_estimate:.1%} "
          f"[{m.ci_lower:.1%}, {m.ci_upper:.1%}]")

Answers "what percentage scored >= X?" for each level. Returns dict[int, MetricResult] with bootstrap CIs.


Rubric Configuration

Define rubrics in a YAML file to keep scoring criteria out of code:

metrics:
  faithfulness:
    type: ordinal
    scale: [1, 2, 3, 4, 5]
    labels:
      1: Hallucinated
      2: Mostly wrong
      3: Partially faithful
      4: Minor issues
      5: Fully faithful
    pass_threshold: 4

  helpfulness:
    type: ordinal
    scale: [1, 2, 3, 4, 5]
    labels:
      1: Unhelpful
      2: Slightly helpful
      3: Moderately helpful
      4: Helpful
      5: Very helpful
    pass_threshold: 3

Load with load_rubric_config:

from latent.stats.config import load_rubric_config

config = load_rubric_config("rubrics.yaml")
faithfulness_rubric = config.metrics["faithfulness"]

dist = ordinal_distribution(scores, rubric=faithfulness_rubric, metric_name="faithfulness")

The RubricConfig object exposes a metrics dict mapping each metric name to a MetricRubric. Each rubric has:

Field Type Description
type str "ordinal" or "binary"
scale list[int] \| None Ordered list of valid score values
labels dict[int, str] \| None Human-readable label per level
pass_threshold int \| None Minimum score that counts as passing