Skip to content

Prompt Optimizers

Latent ships four prompt optimizers that improve agent and judge prompts through automated experimentation. Each optimizer records every trial via ExperimentTracker, integrates with MLflow, and produces an OptimizationResult you can analyze, compare, and persist.

When to Use Which Optimizer

Optimizer Best for Method Extras
DSPyOptimizer Few-shot selection, instruction tuning DSPy teleprompters latent[optimizers]
ACEOptimizer Section-level prompt refinement ACE skillbook adaptation latent[optimizers]
CombinedOptimizer End-to-end two-phase optimization DSPy candidate generation, then ACE calibration latent[optimizers]
AutoResearchOptimizer Codebase-level optimization (prompts, tools, schemas) Claude Agent SDK + git keep/discard loop latent[autoresearch]

Start with DSPyOptimizer

For most prompt-tuning tasks, DSPyOptimizer with mipro_v2 is the fastest path to measurable improvement. Graduate to CombinedOptimizer or AutoResearchOptimizer when you need structural changes beyond prompt wording.

Installation

# DSPy, ACE, and Combined optimizers
pip install "latent[optimizers]"

# AutoResearch optimizer (Claude Agent SDK, macOS only)
pip install "latent[autoresearch]"

# Both
pip install "latent[optimizers,autoresearch]"

Core Data Model

All optimizers share these types from latent.optimize.base:

from latent.optimize import (
    Optimizer,              # Protocol: any class with .optimize()
    OptimizationResult,     # Full run: experiments, best_score, best_label
    ExperimentResult,       # Single trial: label, score, kept, error, metadata
    OptimizedPrompt,        # Serializable artifact: prompt_template, few_shot_demos
    apply_optimized_prompt, # Apply OptimizedPrompt to an agent in-place
    ExperimentTracker,      # Thread-safe tracker with MLflow + checkpoint support
)

OptimizationResult (aliased as OptimizationReport) carries trial-level detail:

Field Type Description
backend str Which optimizer produced this ("dspy", "ace", "combined", "autoresearch")
experiments list[ExperimentResult] Every trial with score, kept/discarded status, metadata
best_score float Best score achieved
best_label str Label of the best experiment
best_config dict Configuration that produced the best result
duration_s float Wall-clock time

DSPyOptimizer

Wraps a BaseAgent or Judge in a DSPy module and runs a teleprompter to optimize instructions and few-shot demonstrations.

Supported Teleprompters

Teleprompter Strategy Key parameter
mipro_v2 Bayesian instruction + demo search num_trials
copro Cooperative prompt optimization breadth
bootstrap_fewshot Bootstrap demonstrations from teacher max_bootstrapped_demos
simba Step-wise instruction bootstrapping num_steps
gepa Genetic prompt algorithm num_iterations

Usage

from latent.optimize import DSPyOptimizer, apply_optimized_prompt, OptimizedPrompt
from latent.agents import Judge
from latent.agents.scores import ScoredModel, OrdinalScore
from typing import Annotated

class Scores(ScoredModel):
    quality: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]

judge = Judge("qa_judge", model="gpt-4o", output_type=Scores)

opt = DSPyOptimizer(
    judge,
    teleprompter="mipro_v2",
    teacher_model="gpt-4o",      # optional: model for generating demos
    student_model="gpt-4o-mini", # optional: model for the student
    train_split=0.8,
    seed=42,
)

def metric(example, prediction, trace=None) -> float:
    return float(prediction.quality >= 3)

result = opt.optimize(
    train_data=data,  # list[dict] with input/output pairs
    metric=metric,
    max_iterations=10,
)

print(f"Best score: {result.best_score}")

# Extract and apply the optimized prompt
best_prompt = OptimizedPrompt.from_best(result)
apply_optimized_prompt(judge, best_prompt)

Configuration

Pass additional teleprompter-specific keyword arguments directly to the constructor:

opt = DSPyOptimizer(
    agent,
    teleprompter="mipro_v2",
    # Extra kwargs forwarded to dspy.MIPROv2(...)
    teacher_settings={"lm": my_teacher_lm},
)

Metric signature

DSPy metrics use the signature (example, prediction, trace) -> float. This differs from the ACE convention. When using CombinedOptimizer, pass dspy_metric and ace_metric separately.


ACEOptimizer

Uses the ACE (Automatic Calibration Engine) framework for section-level prompt refinement. The prompt is decomposed into a skillbook -- a collection of named sections that ACE adapts independently based on evaluation feedback.

Skillbook Concept

A skillbook is a structured prompt divided into sections. Each SectionConfig defines one section:

from latent.optimize.ace.skillbook import SectionConfig

sections = [
    SectionConfig(
        name="instructions",
        description="Core task instructions",
        initial_content="You are a quality scorer...",
        max_skills=10,
    ),
    SectionConfig(
        name="rubric",
        description="Scoring rubric and criteria",
        initial_content="Score from 1-5 based on...",
    ),
    SectionConfig(
        name="examples",
        description="Few-shot examples",
        initial_content="Example 1: ...",
    ),
]

Each round, ACE evaluates the agent, then calls skillbook.adapt(feedback) to refine individual sections based on what worked and what did not.

Usage

from latent.optimize import ACEOptimizer
from latent.optimize.ace.adapter import ACEAdapterConfig

opt = ACEOptimizer(
    agent,
    sections=sections,  # optional: defaults to single "instructions" section
    adapter_config=ACEAdapterConfig(
        cache_enabled=True,
        max_retries=3,
        timeout=30.0,
    ),
)

def metric(expected: str, predicted: str) -> float:
    return 1.0 if expected.strip() == predicted.strip() else 0.0

result = opt.optimize(
    train_data=data,
    metric=metric,          # (expected, predicted) -> float
    max_iterations=10,
)

ACE framework dependency

If ace-framework is not installed, the optimizer runs with a mock skillbook that does not adapt between rounds. Install ace-framework separately for real adaptation.


CombinedOptimizer

Chains DSPy (candidate generation) and ACE (calibration) in a two-phase pipeline. A single ExperimentTracker spans both phases.

Phase 1 -- DSPy: generates candidate prompts via teleprompter optimization. Phase 2 -- ACE: takes the best DSPy candidate and refines it section by section.

Iterations are split evenly between the two phases (max_iterations // 2 each).

Usage

from latent.optimize import CombinedOptimizer

opt = CombinedOptimizer(
    agent,
    dspy_config={
        "teleprompter": "mipro_v2",
        "teacher_model": "gpt-4o",
    },
    ace_config={
        "sections": [
            {"name": "instructions", "description": "Core instructions", "initial_content": "..."},
        ],
        "adapter_config": {"cache_enabled": True},
    },
)

result = opt.optimize(
    train_data=data,
    dspy_metric=dspy_metric,  # (example, prediction, trace) -> float
    ace_metric=ace_metric,    # (expected_str, predicted_str) -> float
    max_iterations=20,        # 10 DSPy + 10 ACE
)

Separate metrics

DSPy and ACE use incompatible metric signatures. Pass dspy_metric and ace_metric separately. If you pass a single metric, it must satisfy both calling conventions.


AutoResearchOptimizer

An autonomous code optimizer that uses the Claude Agent SDK to read your codebase, propose changes, and keep or discard them based on metric evaluation. Unlike the other optimizers which tune prompt text, AutoResearch modifies actual source files (prompts, tool definitions, schemas) in a git-managed loop.

How It Works

Each iteration:

  1. Agent session -- Claude reads the codebase and applies one targeted change.
  2. Quality checks -- runs pytest or other shell commands (controlled by the optimizer, not the agent).
  3. Metric evaluation -- calls your async metric_fn on a dataset sample.
  4. Progressive confirmation -- optionally re-evaluates on a larger sample if the initial result improves.
  5. Git decision -- commits (keep) or hard-resets (discard).
  6. Insight tracking -- classifies each experiment and injects analysis into the next iteration.

Research Brief

The ResearchBrief defines the optimization objective:

from latent.optimize.autoresearch.brief import ResearchBrief

brief = ResearchBrief(
    objective="Improve the text-to-SQL agent's query accuracy",
    metric_name="exact_match",
    direction="higher_is_better",
    relevant_files=[
        "src/agents/sql_agent.py",
        "src/tools/schema_lookup.py",
        "prompts/sql_system.md",
    ],
    constraints=[
        "Do not modify test files",
        "Keep latency under 5 seconds per query",
    ],
    prior_context="The agent currently struggles with JOIN queries.",
    ideas=["Try adding schema examples to the prompt"],
)

Usage

from latent.optimize.autoresearch.optimizer import AutoResearchOptimizer
from pathlib import Path

async def eval_fn(sample: list[dict]) -> dict[str, float]:
    # Run your evaluation, return metric dict
    correct = sum(1 for row in sample if row["predicted"] == row["expected"])
    return {"exact_match": correct / len(sample)}

optimizer = AutoResearchOptimizer(
    brief=brief,
    metric_fn=eval_fn,
    dataset=eval_df,                  # pandas DataFrame
    stratify_column="category",       # stratified sampling
    sample_size=50,                   # tier-1 sample size
    confirmation_size=200,            # tier-2 confirmation sample
    repo_root=Path("."),
    max_iterations=20,
    patience=5,                       # stop after N iterations without improvement
    checks=["pytest tests/unit/ -x"], # shell commands run before eval
    agent_model="claude-opus-4-6",
    agent_max_turns=15,
    allowed_tools=["Read", "Write", "Edit", "Glob", "Grep"],
    checkpoint_path=Path(".autoresearch/checkpoint.json"),
    scope_paths=["src/agents/", "prompts/"],  # limit git ops to these paths
)

result = await optimizer.optimize()

Key Features

Stratified subsampling -- uses stratify_column to ensure balanced evaluation across categories. Progressive difficulty oversamples from historically weak categories.

Experiment insights -- each result is classified as worked/failed/promising and injected into the next iteration's prompt so the agent learns from history.

Failure analysis -- per-category score breakdowns are computed and fed back to the agent.

Rollback on deep regression -- if scores drop by more than 15% for 3 consecutive iterations, reverts to the best checkpoint.

Checkpoint resume -- pass checkpoint_path to resume interrupted runs. The optimizer rebuilds experiment insights from history on resume.

Scope isolation -- scope_paths limits git keep/discard to specific directories, protecting other agents' concurrent work.

macOS only

AutoResearchOptimizer requires the Claude Agent SDK, which is currently macOS-only. The agent runs without a shell tool by default -- quality checks are executed by the optimizer process, not the agent.


Analyze and Compare Results

Trajectory Analysis

from latent.optimize import analyze_trajectory

report = analyze_trajectory(result)
# report.metrics includes:
#   - improvement: delta between first and last kept score
#   - kept_rate: fraction of experiments kept (with Wilson CI)
#   - convergence_delta: score change over last 20% of experiments

Compare Two Optimizers

from latent.optimize import compare_optimizers

comparison = compare_optimizers(result_dspy, result_ace)
# ComparisonResult with delta, CI, p-value, effect size

Visualization

from latent.optimize import plot_optimization_progress, plot_compare_optimizers

# Single run: green=kept, grey=discarded, red=failed, navy line=running best
fig = plot_optimization_progress(result, title="DSPy MIPROv2")

# Overlay multiple runs
fig = plot_compare_optimizers(
    [result_dspy, result_ace, result_combined],
    labels=["DSPy", "ACE", "Combined"],
)

Common Pattern: Optimize, Apply, Evaluate

The standard workflow across all optimizers:

from latent.optimize import DSPyOptimizer, OptimizedPrompt, apply_optimized_prompt

# 1. Optimize
opt = DSPyOptimizer(agent, teleprompter="mipro_v2")
result = opt.optimize(train_data=train, metric=my_metric, max_iterations=10)

# 2. Extract best prompt
best = OptimizedPrompt.from_best(result)

# 3. Save artifact for reproducibility
best.save(Path("artifacts/optimized_prompt.json"))

# 4. Apply to agent
apply_optimized_prompt(agent, best)

# 5. Evaluate on held-out data
from latent.flows.judge_flow import judge_flow
eval_result = judge_flow(eval_data=test_df, judge=agent, gates={"quality": 3.5})

Loading a Saved Prompt

prompt = OptimizedPrompt.from_file(Path("artifacts/optimized_prompt.json"))
apply_optimized_prompt(agent, prompt)

ExperimentTracker

All optimizers use ExperimentTracker for thread-safe recording, checkpointing, and MLflow integration:

from latent.optimize import ExperimentTracker

tracker = ExperimentTracker(
    lower_is_better=False,
    checkpoint_path=Path("checkpoints/run.json"),
    auto_checkpoint=True,
    on_experiment=lambda exp: print(f"{exp.label}: {exp.score:.4f}"),
)

# Share a tracker across optimizers
opt = DSPyOptimizer(agent, tracker=tracker)

Resume from a checkpoint:

tracker = ExperimentTracker.from_checkpoint(Path("checkpoints/run.json"))

AutoResearch: Advanced Features

Completion Report

After an optimization run, generate a structured markdown report covering what worked, what failed, and parked ideas:

from latent.optimize.autoresearch.report import render_completion_report

report_md = render_completion_report(
    result,
    brief=brief,
    insights=optimizer.insights,  # collected during the run
)
# Sections: Executive Summary, Trial Table, What Worked, What Failed, Parked Ideas, Configuration

The report includes a trial table with per-experiment scores and kept/discarded/failed status, improvement deltas, and duration.

Strategy Document

Provide a strategy_md field on ResearchBrief to inject a high-level strategy document into the agent prompt. This guides the agent's approach without constraining specific changes:

brief = ResearchBrief(
    objective="Improve SQL query accuracy",
    metric_name="exact_match",
    direction="higher_is_better",
    relevant_files=["src/agents/sql_agent.py"],
    constraints=["Do not modify test files"],
    strategy_md="""\
Focus on JOIN queries first -- they account for 60% of failures.
Try schema-aware prompting before adding few-shot examples.
The current system prompt mixes directives with SQL examples; separate them.
""",
)

Overfitting Detection

Set validation_split to hold out a portion of the dataset for overfitting detection. The optimizer evaluates each kept experiment on the validation set and warns if the validation score diverges from the train score by more than 10%:

optimizer = AutoResearchOptimizer(
    brief=brief,
    metric_fn=eval_fn,
    dataset=eval_df,
    validation_split=0.2,  # 80% train, 20% validation
    max_iterations=20,
)

Validation split requires sufficient data

With small datasets, the split may produce unreliable validation scores. Use at least 100 examples in the full dataset.

Pre-Eval Verification

Quality checks (pytest, linting) run before metric evaluation, ensuring that proposed changes don't break the codebase:

optimizer = AutoResearchOptimizer(
    brief=brief,
    metric_fn=eval_fn,
    dataset=eval_df,
    checks=["pytest tests/unit/ -x", "ruff check src/"],  # must pass before eval
)

If any check command returns a non-zero exit code, the iteration is classified as failed and the change is discarded without running the (potentially expensive) metric evaluation.

Prompt Engineering Improvements

The AutoResearch agent prompt incorporates context engineering best practices:

  • KV-cache stability -- static instructions are separated from dynamic per-iteration content.
  • Edge-anchoring -- critical constraints appear at the beginning and end of the prompt.
  • Structured insights -- experiment history is formatted as structured blocks rather than free text.
  • Diff capping -- only the most recent diffs are shown in full; older ones are collapsed to one-liners.
  • Focus headlines -- each iteration gets a clear headline summarizing the current objective.
  • Simplicity tiebreaker -- when two approaches score equally, the simpler one is preferred.