Classification, Structured Output & Ordinal Score Metrics¶
Metrics for evaluating discrete predictions, structured LLM outputs, and ordinal scoring rubrics.
All functions return MetricResult objects with point estimates and confidence intervals.
Classification Metrics¶
classification_metrics¶
from latent.stats import classification_metrics
metrics = classification_metrics(
y_pred=predictions,
y_true=actuals,
labels=None, # inferred from data when None
n_resamples=10_000,
confidence_level=0.95,
seed=None,
)
Computes accuracy, macro/micro/weighted F1, macro precision, macro recall, and MCC. Every metric comes with a BCa bootstrap confidence interval.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
y_pred |
list \| np.ndarray |
required | Predicted labels |
y_true |
list \| np.ndarray |
required | Ground-truth labels |
labels |
list[str] \| None |
None |
Ordered label names. Inferred from data when None. |
n_resamples |
int |
10_000 |
Bootstrap resamples |
confidence_level |
float |
0.95 |
CI confidence level |
seed |
int \| None |
None |
Random seed for reproducibility |
Returns: dict[str, MetricResult] with keys accuracy, precision_macro, recall_macro, f1_macro, f1_micro, f1_weighted, mcc.
Example -- intent classification with 5 classes:
from latent.stats import classification_metrics
predictions = ["book_flight", "weather", "book_flight", "music", "alarm", ...]
actuals = ["book_flight", "weather", "alarm", "music", "alarm", ...]
metrics = classification_metrics(y_pred=predictions, y_true=actuals)
for name, result in metrics.items():
print(f"{name}: {result.point_estimate:.3f} "
f"[{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
accuracy: 0.823 [0.791, 0.854]
precision_macro: 0.810 [0.772, 0.845]
recall_macro: 0.805 [0.768, 0.840]
f1_macro: 0.807 [0.770, 0.842]
f1_micro: 0.823 [0.791, 0.854]
f1_weighted: 0.821 [0.789, 0.852]
mcc: 0.776 [0.735, 0.814]
per_class_metrics¶
from latent.stats import per_class_metrics
per_class = per_class_metrics(
y_pred=predictions,
y_true=actuals,
labels=None,
n_resamples=10_000,
confidence_level=0.95,
seed=None,
)
Returns per-class precision, recall, and F1 -- each with bootstrap CIs.
Returns: dict[str, dict[str, MetricResult]] -- outer key is the class label, inner keys are precision, recall, f1.
Example -- finding which intent class underperforms:
for label, label_metrics in per_class.items():
f1 = label_metrics["f1"]
flag = " << low" if f1.point_estimate < 0.70 else ""
print(f" {label:20s} F1={f1.point_estimate:.3f} "
f"[{f1.ci_lower:.3f}, {f1.ci_upper:.3f}]{flag}")
alarm F1=0.912 [0.871, 0.946]
book_flight F1=0.854 [0.810, 0.893]
music F1=0.790 [0.738, 0.836]
play_podcast F1=0.643 [0.581, 0.702] << low
weather F1=0.838 [0.793, 0.878]
confusion_matrix_with_ci¶
from latent.stats import confusion_matrix_with_ci
cm = confusion_matrix_with_ci(
y_pred=predictions,
y_true=actuals,
labels=None,
n_resamples=10_000,
confidence_level=0.95,
seed=None,
)
Builds a standard confusion matrix and adds per-cell bootstrap CIs.
Returns: ConfusionMatrixResult with fields:
| Field | Type | Description |
|---|---|---|
matrix |
list[list[int]] |
Raw confusion matrix counts |
ci_lower |
list[list[float]] |
Per-cell lower CI bounds |
ci_upper |
list[list[float]] |
Per-cell upper CI bounds |
labels |
list[str] |
Ordered class labels |
Structured Output Metrics¶
Use these when your LLM produces JSON, Pydantic models, or other structured data.
schema_compliance¶
from latent.stats import schema_compliance
result = schema_compliance(
outputs=llm_outputs, # list of dicts
schema=json_schema, # JSON Schema dict
)
Validates each output against a JSON schema and returns the compliance rate with a Wilson CI.
Example -- checking whether an LLM returns valid JSON:
schema = {
"type": "object",
"properties": {
"intent": {"type": "string"},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
},
"required": ["intent", "confidence"],
}
result = schema_compliance(outputs=llm_outputs, schema=schema)
print(f"Schema compliance: {result.point_estimate:.1%} "
f"[{result.ci_lower:.1%}, {result.ci_upper:.1%}]")
field_level_accuracy¶
from latent.stats import field_level_accuracy
results = field_level_accuracy(
outputs=predicted_dicts,
expected=ground_truth_dicts,
fields=None, # evaluate all fields when None
)
Per-field exact-match accuracy with Wilson CIs. Useful for comparing extracted entities field by field.
Returns: dict[str, MetricResult] keyed by field name.
Example:
predicted = [
{"name": "Alice", "city": "NYC", "age": 30},
{"name": "Bob", "city": "Boston", "age": 25},
]
expected = [
{"name": "Alice", "city": "NYC", "age": 31},
{"name": "Bob", "city": "LA", "age": 25},
]
for field, m in field_level_accuracy(predicted, expected).items():
print(f" {field}: {m.point_estimate:.0%}")
composite_accuracy¶
from latent.stats import composite_accuracy
result = composite_accuracy(
outputs=predicted_dicts,
expected=ground_truth_dicts,
fields=None,
)
For each row, computes the fraction of fields that match exactly, then aggregates with a bootstrap CI. A row counts as fully correct only when every evaluated field matches.
result_set_match¶
from latent.stats import result_set_match
result = result_set_match(
actual_sets=predicted_sets, # list[list[dict]]
expected_sets=ground_truth_sets,
)
Order-independent set matching. Each pair of (actual, expected) is compared as an unordered set of rows. Returns a Wilson CI on the exact-match rate.
Example -- comparing SQL query result sets:
predicted_sets = [
[{"id": 1, "name": "A"}, {"id": 2, "name": "B"}],
[{"id": 3, "name": "C"}],
]
expected_sets = [
[{"id": 2, "name": "B"}, {"id": 1, "name": "A"}], # same rows, different order
[{"id": 3, "name": "D"}], # mismatch
]
m = result_set_match(predicted_sets, expected_sets)
print(f"Set match rate: {m.point_estimate:.0%}") # 50%
Ordinal Scores¶
Ordinal metrics handle Likert-style or rubric-graded scores (e.g. 1--5 faithfulness).
ordinal_distribution¶
from latent.stats import ordinal_distribution
from latent.stats.config import MetricRubric
rubric = MetricRubric(
type="ordinal",
scale=[1, 2, 3, 4, 5],
labels={
1: "Hallucinated",
2: "Mostly wrong",
3: "Partial",
4: "Minor issues",
5: "Faithful",
},
pass_threshold=4,
)
dist = ordinal_distribution(
scores=scores,
rubric=rubric,
metric_name="faithfulness",
n_resamples=10_000,
confidence_level=0.95,
seed=None,
)
print(f"Pass rate: {dist.pass_rate:.1%}")
print(f"Median: {dist.median} [{dist.median_ci_lower}, {dist.median_ci_upper}]")
for level in dist.levels:
marker = " *" if level.is_passing else ""
print(f" {level.label}: {level.proportion:.1%} "
f"[{level.ci_lower:.1%}, {level.ci_upper:.1%}]{marker}")
Returns: OrdinalDistribution with:
| Field | Type | Description |
|---|---|---|
pass_rate |
float |
Proportion of scores >= pass_threshold |
pass_rate_ci_lower / pass_rate_ci_upper |
float |
Wilson CI on pass rate |
pass_threshold |
int |
Threshold used |
levels |
list[OrdinalLevel] |
Per-level proportion, CI, label, is_passing flag |
median |
int |
Median score |
median_ci_lower / median_ci_upper |
int |
Bootstrap CI on median |
cumulative |
dict[int, MetricResult] |
Cumulative proportions (% >= X) per level |
Tip
If you omit the rubric, one is inferred from the observed score values with generic labels. You will see a warning recommending explicit configuration.
binarize¶
from latent.stats import binarize
binary = binarize(scores, threshold=4)
# array([0., 0., 1., 1., 1., ...])
Converts ordinal scores to binary pass/fail (1 if >= threshold, else 0). Returns a float np.ndarray.
median_ci¶
from latent.stats import median_ci
med, ci_lo, ci_hi = median_ci(scores, n_resamples=10_000, confidence_level=0.95, seed=None)
print(f"Median: {med} [{ci_lo}, {ci_hi}]")
Bootstrap CI on the median. Returns a tuple of three ints (median, ci_lower, ci_upper).
cumulative_proportions¶
from latent.stats import cumulative_proportions
cumul = cumulative_proportions(
scores,
scale=[1, 2, 3, 4, 5],
n_resamples=10_000,
confidence_level=0.95,
seed=None,
)
for level, m in cumul.items():
print(f" >= {level}: {m.point_estimate:.1%} "
f"[{m.ci_lower:.1%}, {m.ci_upper:.1%}]")
Answers "what percentage scored >= X?" for each level. Returns dict[int, MetricResult] with bootstrap CIs.
Rubric Configuration¶
Define rubrics in a YAML file to keep scoring criteria out of code:
metrics:
faithfulness:
type: ordinal
scale: [1, 2, 3, 4, 5]
labels:
1: Hallucinated
2: Mostly wrong
3: Partially faithful
4: Minor issues
5: Fully faithful
pass_threshold: 4
helpfulness:
type: ordinal
scale: [1, 2, 3, 4, 5]
labels:
1: Unhelpful
2: Slightly helpful
3: Moderately helpful
4: Helpful
5: Very helpful
pass_threshold: 3
Load with load_rubric_config:
from latent.stats.config import load_rubric_config
config = load_rubric_config("rubrics.yaml")
faithfulness_rubric = config.metrics["faithfulness"]
dist = ordinal_distribution(scores, rubric=faithfulness_rubric, metric_name="faithfulness")
The RubricConfig object exposes a metrics dict mapping each metric name to a MetricRubric. Each rubric has:
| Field | Type | Description |
|---|---|---|
type |
str |
"ordinal" or "binary" |
scale |
list[int] \| None |
Ordered list of valid score values |
labels |
dict[int, str] \| None |
Human-readable label per level |
pass_threshold |
int \| None |
Minimum score that counts as passing |