Skip to content

Bias Correction & Inter-Judge Agreement

The Problem: Judge Bias

LLM judges (GPT-4, Claude, etc.) have systematic biases. Some are consistently lenient, others are harsh, and most are inconsistent across categories. If you use raw judge scores as ground truth, every downstream metric inherits that bias.

Prediction-Powered Inference (PPI) fixes this. The idea: score a small calibration set with both the judge and a human, measure the gap, and apply that correction to the full dataset. You get a bias-corrected estimate with valid confidence intervals -- even when 95% of your labels come from an automated judge.

Prediction-Powered Inference (PPI)

ppi_mean

Corrects judge bias using a calibration set where both the judge and a human scored the same items.

Formula: theta_ppi = mean(judge_unlabeled) + mean(human_calibration - judge_calibration)

The correction term mean(human - judge) on the calibration set estimates the systematic bias and shifts the full-dataset judge mean accordingly. The returned confidence interval accounts for uncertainty in both the judge scores and the calibration correction.

from latent.stats import ppi_mean

# 500 items scored by judge, 50 scored by both judge and human
result = ppi_mean(
    judge_scores=all_judge_scores,       # shape (500,)
    calibration_judge=cal_judge_scores,   # shape (50,)
    calibration_human=cal_human_scores,   # shape (50,)
)

print(f"Bias-corrected mean: {result.point_estimate:.3f}")
print(f"CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"Status: {result.calibration_status}")  # "calibrated"

Parameters:

Parameter Type Description
judge_scores np.ndarray Judge scores on the full (unlabeled) dataset
calibration_judge np.ndarray Judge scores on the calibration subset
calibration_human np.ndarray Human scores on the same calibration subset
confidence_level float Confidence level for the interval (default 0.95)
seed int \| None Random seed for reproducibility

Returns: MetricResult with calibration_status="calibrated".

Tip

You only need 50-200 human labels to meaningfully correct bias across thousands of judge evaluations. The key requirement is that the calibration subset is randomly sampled from the same distribution as the full dataset.

stratified_ppi

When judge accuracy varies by category -- for example, a judge might be accurate on "billing" questions but systematically off on "refund" questions -- unstratified PPI leaves performance on the table. stratified_ppi applies PPI within each stratum and combines the results, producing tighter confidence intervals.

from latent.stats import stratified_ppi

result = stratified_ppi(
    judge_scores=scores,
    calibration_judge=cal_judge,
    calibration_human=cal_human,
    strata=categories,                 # e.g., ["billing", "refund", ...]
    calibration_strata=cal_categories,
)

print(f"Stratified bias-corrected mean: {result.point_estimate:.3f}")
print(f"CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")

Parameters:

Parameter Type Description
judge_scores np.ndarray Judge scores on the full dataset
calibration_judge np.ndarray Judge scores on the calibration subset
calibration_human np.ndarray Human scores on the calibration subset
strata np.ndarray Category labels for each item in the full dataset
calibration_strata np.ndarray Category labels for each calibration item
confidence_level float Confidence level (default 0.95)
seed int \| None Random seed for reproducibility

Returns: MetricResult with calibration_status="calibrated".

Note

Each stratum must have at least a few calibration samples. If a stratum has zero calibration data, stratified_ppi falls back to the unstratified correction for that stratum.

uncalibrated_estimate

Fallback when no calibration data exists. Computes a standard bootstrap confidence interval on the raw judge scores, but marks the result as uncalibrated so downstream consumers know the estimate may be biased.

from latent.stats import uncalibrated_estimate

result = uncalibrated_estimate(scores=judge_scores)

print(f"Raw mean: {result.point_estimate:.3f}")
print(f"Status: {result.calibration_status}")  # "uncalibrated"

Warning

An uncalibrated estimate is better than nothing, but the confidence interval only captures sampling variance -- not judge bias. Treat these results as provisional until you collect calibration data.

Calibration Statistics

compute_calibration

Measures how well a judge's binary labels align with human labels. Reports sensitivity (true positive rate) and specificity (true negative rate), giving you a clear picture of where the judge succeeds and where it fails.

from latent.stats import compute_calibration

cal = compute_calibration(judge_labels, human_labels)

print(f"Sensitivity (TPR): {cal.sensitivity:.2f}")
print(f"Specificity (TNR): {cal.specificity:.2f}")
print(f"Calibration samples: {cal.n_samples}")

Returns: CalibrationStats with fields sensitivity, specificity, and n_samples.

Info

compute_calibration requires at least 20 calibration samples to produce stable estimates. It raises InsufficientDataError if the sample count is below this threshold.

Inter-Judge Agreement

When multiple judges (human or automated) score the same items, you need to quantify how much they agree -- and whether that agreement exceeds chance.

cohens_kappa

Pairwise agreement between two judges, corrected for chance agreement.

from latent.stats import cohens_kappa

kappa = cohens_kappa(judge_a_labels, judge_b_labels)
print(f"Cohen's kappa: {kappa:.3f}")
Range Interpretation
0.81 - 1.00 Almost perfect agreement
0.61 - 0.80 Substantial agreement
0.41 - 0.60 Moderate agreement
0.21 - 0.40 Fair agreement
0.00 - 0.20 Slight agreement
< 0.00 Less than chance agreement

fleiss_kappa

Extends kappa to any number of judges. Each item must be scored by every judge.

from latent.stats import fleiss_kappa

# ratings shape: (n_items, n_judges)
kappa = fleiss_kappa(ratings)
print(f"Fleiss' kappa: {kappa:.3f}")

krippendorffs_alpha

The most flexible agreement metric. Handles missing data (judges can skip items) and supports different measurement levels.

from latent.stats import krippendorffs_alpha

# Ordinal scores (e.g., 1-5 rubric)
alpha = krippendorffs_alpha(ratings, level_of_measurement="ordinal")
print(f"Krippendorff's alpha: {alpha:.3f}")

Measurement levels:

Level Use case Example
"nominal" Unordered categories pass/fail, intent classes
"ordinal" Ordered categories 1-5 quality rubric
"interval" Equal-spaced numeric Temperature, year
"ratio" Numeric with true zero Latency, word count

Tip

Use krippendorffs_alpha when judges may skip items or when you have ordinal scores. It is the most robust choice for real-world annotation setups where data is messy.

majority_vote

Aggregates binary scores from multiple judges via majority vote. Ties resolve to 0 (conservative).

from latent.stats import majority_vote

# judge_scores shape: (n_items, n_judges), values 0 or 1
aggregated = majority_vote(judge_scores)
print(f"Aggregated labels: {aggregated}")  # shape (n_items,)

disagreement_flags

Flags items where judges disagree above a given threshold. Useful for triaging ambiguous items that need human review.

from latent.stats import disagreement_flags

# Flag items where less than 50% of judges agree
flags = disagreement_flags(judge_scores, threshold=0.5)
print(f"Items needing review: {flags.sum()}")

Items where flags[i] is True had high disagreement and should be routed to a human reviewer.

Practical Workflow

Putting it all together -- from raw judge scores to a bias-corrected estimate with confidence intervals:

Step 1: Collect calibration data

Have N judges (human or automated) score a shared calibration set. This is your ground truth.

import numpy as np
from latent.stats import fleiss_kappa, majority_vote

# 3 judges scored 100 calibration items (binary: 0 or 1)
calibration_ratings = np.array(...)  # shape (100, 3)

Step 2: Measure agreement

Compute inter-judge agreement to verify your human labels are reliable.

kappa = fleiss_kappa(calibration_ratings)
print(f"Fleiss' kappa: {kappa:.3f}")

if kappa < 0.4:
    print("Warning: low agreement. Review annotation guidelines.")

Step 3: Aggregate human labels

If agreement is acceptable, aggregate the human ratings into a single label per item.

calibration_human = majority_vote(calibration_ratings)

Step 4: Apply PPI correction

Use the aggregated human labels as the calibration reference to correct the judge on the full dataset.

from latent.stats import ppi_mean

result = ppi_mean(
    judge_scores=all_judge_scores,         # 5000 items scored by LLM judge
    calibration_judge=cal_judge_scores,    # same 100 items scored by LLM judge
    calibration_human=calibration_human,   # aggregated human labels for those 100
)

print(f"Bias-corrected mean: {result.point_estimate:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")

Step 5: Flag disagreements for review

Identify items in the calibration set where judges disagreed -- these often reveal ambiguous cases or annotation guideline gaps.

from latent.stats import disagreement_flags

flags = disagreement_flags(calibration_ratings, threshold=0.5)
ambiguous_items = np.where(flags)[0]
print(f"{len(ambiguous_items)} items flagged for review")

How many calibration samples do I need?

As few as 50 randomly-sampled items can meaningfully reduce bias. 100-200 is a good target for most use cases. The calibration set must be a random sample from the same distribution as the full dataset.

See Also