Bias Correction & Inter-Judge Agreement¶
The Problem: Judge Bias¶
LLM judges (GPT-4, Claude, etc.) have systematic biases. Some are consistently lenient, others are harsh, and most are inconsistent across categories. If you use raw judge scores as ground truth, every downstream metric inherits that bias.
Prediction-Powered Inference (PPI) fixes this. The idea: score a small calibration set with both the judge and a human, measure the gap, and apply that correction to the full dataset. You get a bias-corrected estimate with valid confidence intervals -- even when 95% of your labels come from an automated judge.
Prediction-Powered Inference (PPI)¶
ppi_mean¶
Corrects judge bias using a calibration set where both the judge and a human scored the same items.
Formula: theta_ppi = mean(judge_unlabeled) + mean(human_calibration - judge_calibration)
The correction term mean(human - judge) on the calibration set estimates the systematic bias and shifts the full-dataset judge mean accordingly. The returned confidence interval accounts for uncertainty in both the judge scores and the calibration correction.
from latent.stats import ppi_mean
# 500 items scored by judge, 50 scored by both judge and human
result = ppi_mean(
judge_scores=all_judge_scores, # shape (500,)
calibration_judge=cal_judge_scores, # shape (50,)
calibration_human=cal_human_scores, # shape (50,)
)
print(f"Bias-corrected mean: {result.point_estimate:.3f}")
print(f"CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"Status: {result.calibration_status}") # "calibrated"
Parameters:
| Parameter | Type | Description |
|---|---|---|
judge_scores |
np.ndarray |
Judge scores on the full (unlabeled) dataset |
calibration_judge |
np.ndarray |
Judge scores on the calibration subset |
calibration_human |
np.ndarray |
Human scores on the same calibration subset |
confidence_level |
float |
Confidence level for the interval (default 0.95) |
seed |
int \| None |
Random seed for reproducibility |
Returns: MetricResult with calibration_status="calibrated".
Tip
You only need 50-200 human labels to meaningfully correct bias across thousands of judge evaluations. The key requirement is that the calibration subset is randomly sampled from the same distribution as the full dataset.
stratified_ppi¶
When judge accuracy varies by category -- for example, a judge might be accurate on "billing" questions but systematically off on "refund" questions -- unstratified PPI leaves performance on the table. stratified_ppi applies PPI within each stratum and combines the results, producing tighter confidence intervals.
from latent.stats import stratified_ppi
result = stratified_ppi(
judge_scores=scores,
calibration_judge=cal_judge,
calibration_human=cal_human,
strata=categories, # e.g., ["billing", "refund", ...]
calibration_strata=cal_categories,
)
print(f"Stratified bias-corrected mean: {result.point_estimate:.3f}")
print(f"CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
Parameters:
| Parameter | Type | Description |
|---|---|---|
judge_scores |
np.ndarray |
Judge scores on the full dataset |
calibration_judge |
np.ndarray |
Judge scores on the calibration subset |
calibration_human |
np.ndarray |
Human scores on the calibration subset |
strata |
np.ndarray |
Category labels for each item in the full dataset |
calibration_strata |
np.ndarray |
Category labels for each calibration item |
confidence_level |
float |
Confidence level (default 0.95) |
seed |
int \| None |
Random seed for reproducibility |
Returns: MetricResult with calibration_status="calibrated".
Note
Each stratum must have at least a few calibration samples. If a stratum has
zero calibration data, stratified_ppi falls back to the unstratified
correction for that stratum.
uncalibrated_estimate¶
Fallback when no calibration data exists. Computes a standard bootstrap confidence interval on the raw judge scores, but marks the result as uncalibrated so downstream consumers know the estimate may be biased.
from latent.stats import uncalibrated_estimate
result = uncalibrated_estimate(scores=judge_scores)
print(f"Raw mean: {result.point_estimate:.3f}")
print(f"Status: {result.calibration_status}") # "uncalibrated"
Warning
An uncalibrated estimate is better than nothing, but the confidence interval only captures sampling variance -- not judge bias. Treat these results as provisional until you collect calibration data.
Calibration Statistics¶
compute_calibration¶
Measures how well a judge's binary labels align with human labels. Reports sensitivity (true positive rate) and specificity (true negative rate), giving you a clear picture of where the judge succeeds and where it fails.
from latent.stats import compute_calibration
cal = compute_calibration(judge_labels, human_labels)
print(f"Sensitivity (TPR): {cal.sensitivity:.2f}")
print(f"Specificity (TNR): {cal.specificity:.2f}")
print(f"Calibration samples: {cal.n_samples}")
Returns: CalibrationStats with fields sensitivity, specificity, and n_samples.
Info
compute_calibration requires at least 20 calibration samples to produce
stable estimates. It raises InsufficientDataError if the sample count is
below this threshold.
Inter-Judge Agreement¶
When multiple judges (human or automated) score the same items, you need to quantify how much they agree -- and whether that agreement exceeds chance.
cohens_kappa¶
Pairwise agreement between two judges, corrected for chance agreement.
from latent.stats import cohens_kappa
kappa = cohens_kappa(judge_a_labels, judge_b_labels)
print(f"Cohen's kappa: {kappa:.3f}")
| Range | Interpretation |
|---|---|
0.81 - 1.00 |
Almost perfect agreement |
0.61 - 0.80 |
Substantial agreement |
0.41 - 0.60 |
Moderate agreement |
0.21 - 0.40 |
Fair agreement |
0.00 - 0.20 |
Slight agreement |
< 0.00 |
Less than chance agreement |
fleiss_kappa¶
Extends kappa to any number of judges. Each item must be scored by every judge.
from latent.stats import fleiss_kappa
# ratings shape: (n_items, n_judges)
kappa = fleiss_kappa(ratings)
print(f"Fleiss' kappa: {kappa:.3f}")
krippendorffs_alpha¶
The most flexible agreement metric. Handles missing data (judges can skip items) and supports different measurement levels.
from latent.stats import krippendorffs_alpha
# Ordinal scores (e.g., 1-5 rubric)
alpha = krippendorffs_alpha(ratings, level_of_measurement="ordinal")
print(f"Krippendorff's alpha: {alpha:.3f}")
Measurement levels:
| Level | Use case | Example |
|---|---|---|
"nominal" |
Unordered categories | pass/fail, intent classes |
"ordinal" |
Ordered categories | 1-5 quality rubric |
"interval" |
Equal-spaced numeric | Temperature, year |
"ratio" |
Numeric with true zero | Latency, word count |
Tip
Use krippendorffs_alpha when judges may skip items or when you have
ordinal scores. It is the most robust choice for real-world annotation
setups where data is messy.
majority_vote¶
Aggregates binary scores from multiple judges via majority vote. Ties resolve to 0 (conservative).
from latent.stats import majority_vote
# judge_scores shape: (n_items, n_judges), values 0 or 1
aggregated = majority_vote(judge_scores)
print(f"Aggregated labels: {aggregated}") # shape (n_items,)
disagreement_flags¶
Flags items where judges disagree above a given threshold. Useful for triaging ambiguous items that need human review.
from latent.stats import disagreement_flags
# Flag items where less than 50% of judges agree
flags = disagreement_flags(judge_scores, threshold=0.5)
print(f"Items needing review: {flags.sum()}")
Items where flags[i] is True had high disagreement and should be routed to a human reviewer.
Practical Workflow¶
Putting it all together -- from raw judge scores to a bias-corrected estimate with confidence intervals:
Step 1: Collect calibration data¶
Have N judges (human or automated) score a shared calibration set. This is your ground truth.
import numpy as np
from latent.stats import fleiss_kappa, majority_vote
# 3 judges scored 100 calibration items (binary: 0 or 1)
calibration_ratings = np.array(...) # shape (100, 3)
Step 2: Measure agreement¶
Compute inter-judge agreement to verify your human labels are reliable.
kappa = fleiss_kappa(calibration_ratings)
print(f"Fleiss' kappa: {kappa:.3f}")
if kappa < 0.4:
print("Warning: low agreement. Review annotation guidelines.")
Step 3: Aggregate human labels¶
If agreement is acceptable, aggregate the human ratings into a single label per item.
Step 4: Apply PPI correction¶
Use the aggregated human labels as the calibration reference to correct the judge on the full dataset.
from latent.stats import ppi_mean
result = ppi_mean(
judge_scores=all_judge_scores, # 5000 items scored by LLM judge
calibration_judge=cal_judge_scores, # same 100 items scored by LLM judge
calibration_human=calibration_human, # aggregated human labels for those 100
)
print(f"Bias-corrected mean: {result.point_estimate:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
Step 5: Flag disagreements for review¶
Identify items in the calibration set where judges disagreed -- these often reveal ambiguous cases or annotation guideline gaps.
from latent.stats import disagreement_flags
flags = disagreement_flags(calibration_ratings, threshold=0.5)
ambiguous_items = np.where(flags)[0]
print(f"{len(ambiguous_items)} items flagged for review")
How many calibration samples do I need?
As few as 50 randomly-sampled items can meaningfully reduce bias. 100-200 is a good target for most use cases. The calibration set must be a random sample from the same distribution as the full dataset.
See Also¶
- Core Primitives -- Bootstrap CIs, Wilson intervals, permutation tests
- System Comparison -- A/B testing two systems with paired bootstrap
- Reporting -- Rendering results to markdown and MLflow