Skip to content

Prompt Optimizers

Latent ships four prompt optimizers that improve agent and judge prompts through automated experimentation. Each optimizer records every trial via ExperimentTracker, integrates with MLflow, and produces an OptimizationResult you can analyze, compare, and persist.

When to Use Which Optimizer

Optimizer Best for Method Extras
DSPyOptimizer Few-shot selection, instruction tuning DSPy teleprompters latent[optimizers]
ACEOptimizer Section-level prompt refinement ACE skillbook adaptation latent[optimizers]
CombinedOptimizer End-to-end two-phase optimization DSPy candidate generation, then ACE calibration latent[optimizers]
AutoResearchOptimizer Codebase-level optimization (prompts, tools, schemas) Claude Agent SDK + git keep/discard loop latent[autoresearch]

Start with DSPyOptimizer

For most prompt-tuning tasks, DSPyOptimizer with mipro_v2 is the fastest path to measurable improvement. Graduate to CombinedOptimizer or AutoResearchOptimizer when you need structural changes beyond prompt wording.

Installation

# DSPy, ACE, and Combined optimizers
pip install "latent[optimizers]"

# AutoResearch optimizer (Claude Agent SDK, macOS only)
pip install "latent[autoresearch]"

# Both
pip install "latent[optimizers,autoresearch]"

Core Data Model

All optimizers share these types from latent.optimize.base:

from latent.optimize import (
    Optimizer,              # Protocol: any class with .optimize()
    OptimizationResult,     # Full run: experiments, best_score, best_label
    ExperimentResult,       # Single trial: label, score, kept, error, metadata
    OptimizedPrompt,        # Serializable artifact: prompt_template, few_shot_demos
    apply_optimized_prompt, # Apply OptimizedPrompt to an agent in-place
    ExperimentTracker,      # Thread-safe tracker with MLflow + checkpoint support
)

OptimizationResult (aliased as OptimizationReport) carries trial-level detail:

Field Type Description
backend str Which optimizer produced this ("dspy", "ace", "combined", "autoresearch")
experiments list[ExperimentResult] Every trial with score, kept/discarded status, metadata
best_score float Best score achieved
best_label str Label of the best experiment
best_config dict Configuration that produced the best result
duration_s float Wall-clock time

DSPyOptimizer

Wraps a BaseAgent or Judge in a DSPy module and runs a teleprompter to optimize instructions and few-shot demonstrations.

Supported Teleprompters

Teleprompter Strategy Key parameter
mipro_v2 Bayesian instruction + demo search num_trials
copro Cooperative prompt optimization breadth
bootstrap_fewshot Bootstrap demonstrations from teacher max_bootstrapped_demos
simba Step-wise instruction bootstrapping num_steps
gepa Genetic prompt algorithm num_iterations

Usage

from latent.optimize import DSPyOptimizer, apply_optimized_prompt, OptimizedPrompt
from latent.agents import Judge
from latent.agents.scores import ScoredModel, OrdinalScore
from typing import Annotated

class Scores(ScoredModel):
    quality: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]

judge = Judge("qa_judge", model="gpt-4o", output_type=Scores)

opt = DSPyOptimizer(
    judge,
    teleprompter="mipro_v2",
    teacher_model="gpt-4o",      # optional: model for generating demos
    student_model="gpt-4o-mini", # optional: model for the student
    train_split=0.8,
    seed=42,
)

def metric(example, prediction, trace=None) -> float:
    return float(prediction.quality >= 3)

result = opt.optimize(
    train_data=data,  # list[dict] with input/output pairs
    metric=metric,
    max_iterations=10,
)

print(f"Best score: {result.best_score}")

# Extract and apply the optimized prompt
best_prompt = OptimizedPrompt.from_best(result)
apply_optimized_prompt(judge, best_prompt)

Configuration

Pass additional teleprompter-specific keyword arguments directly to the constructor:

opt = DSPyOptimizer(
    agent,
    teleprompter="mipro_v2",
    # Extra kwargs forwarded to dspy.MIPROv2(...)
    teacher_settings={"lm": my_teacher_lm},
)

Metric signature

DSPy metrics use the signature (example, prediction, trace) -> float. This differs from the ACE convention. When using CombinedOptimizer, pass dspy_metric and ace_metric separately.


ACEOptimizer

Uses the ACE (Automatic Calibration Engine) framework for section-level prompt refinement. The prompt is decomposed into a skillbook -- a collection of named sections that ACE adapts independently based on evaluation feedback.

Skillbook Concept

A skillbook is a structured prompt divided into sections. Each SectionConfig defines one section:

from latent.optimize.ace.skillbook import SectionConfig

sections = [
    SectionConfig(
        name="instructions",
        description="Core task instructions",
        initial_content="You are a quality scorer...",
        max_skills=10,
    ),
    SectionConfig(
        name="rubric",
        description="Scoring rubric and criteria",
        initial_content="Score from 1-5 based on...",
    ),
    SectionConfig(
        name="examples",
        description="Few-shot examples",
        initial_content="Example 1: ...",
    ),
]

Each round, ACE evaluates the agent, then calls skillbook.adapt(feedback) to refine individual sections based on what worked and what did not.

Usage

from latent.optimize import ACEOptimizer
from latent.optimize.ace.adapter import ACEAdapterConfig

opt = ACEOptimizer(
    agent,
    sections=sections,  # optional: defaults to single "instructions" section
    adapter_config=ACEAdapterConfig(
        cache_enabled=True,
        max_retries=3,
        timeout=30.0,
    ),
)

def metric(expected: str, predicted: str) -> float:
    return 1.0 if expected.strip() == predicted.strip() else 0.0

result = opt.optimize(
    train_data=data,
    metric=metric,          # (expected, predicted) -> float
    max_iterations=10,
)

ACE framework dependency

If ace-framework is not installed, the optimizer runs with a mock skillbook that does not adapt between rounds. Install ace-framework separately for real adaptation.


CombinedOptimizer

Chains DSPy (candidate generation) and ACE (calibration) in a two-phase pipeline. A single ExperimentTracker spans both phases.

Phase 1 -- DSPy: generates candidate prompts via teleprompter optimization. Phase 2 -- ACE: takes the best DSPy candidate and refines it section by section.

Iterations are split evenly between the two phases (max_iterations // 2 each).

Usage

from latent.optimize import CombinedOptimizer

opt = CombinedOptimizer(
    agent,
    dspy_config={
        "teleprompter": "mipro_v2",
        "teacher_model": "gpt-4o",
    },
    ace_config={
        "sections": [
            {"name": "instructions", "description": "Core instructions", "initial_content": "..."},
        ],
        "adapter_config": {"cache_enabled": True},
    },
)

result = opt.optimize(
    train_data=data,
    dspy_metric=dspy_metric,  # (example, prediction, trace) -> float
    ace_metric=ace_metric,    # (expected_str, predicted_str) -> float
    max_iterations=20,        # 10 DSPy + 10 ACE
)

Separate metrics

DSPy and ACE use incompatible metric signatures. Pass dspy_metric and ace_metric separately. If you pass a single metric, it must satisfy both calling conventions.


AutoResearchOptimizer

An autonomous code optimizer that uses the Claude Agent SDK to read your codebase, propose changes, and keep or discard them based on metric evaluation. Unlike the other optimizers which tune prompt text, AutoResearch modifies actual source files (prompts, tool definitions, schemas) in a git-managed loop.

How It Works

Each iteration:

  1. Agent session -- Claude reads the codebase and applies one targeted change.
  2. Quality checks -- runs pytest or other shell commands (controlled by the optimizer, not the agent).
  3. Metric evaluation -- calls your async metric_fn on a dataset sample.
  4. Progressive confirmation -- optionally re-evaluates on a larger sample if the initial result improves.
  5. Git decision -- commits (keep) or hard-resets (discard).
  6. Insight tracking -- classifies each experiment and injects analysis into the next iteration.

Research Brief

The ResearchBrief defines the optimization objective:

from latent.optimize.autoresearch.brief import ResearchBrief

brief = ResearchBrief(
    objective="Improve the text-to-SQL agent's query accuracy",
    metric_name="exact_match",
    direction="higher_is_better",
    relevant_files=[
        "src/agents/sql_agent.py",
        "src/tools/schema_lookup.py",
        "prompts/sql_system.md",
    ],
    constraints=[
        "Do not modify test files",
        "Keep latency under 5 seconds per query",
    ],
    prior_context="The agent currently struggles with JOIN queries.",
    ideas=["Try adding schema examples to the prompt"],
)

Usage

from latent.optimize.autoresearch.optimizer import AutoResearchOptimizer
from pathlib import Path

async def eval_fn(sample: list[dict]) -> dict[str, float]:
    # Run your evaluation, return metric dict
    correct = sum(1 for row in sample if row["predicted"] == row["expected"])
    return {"exact_match": correct / len(sample)}

optimizer = AutoResearchOptimizer(
    brief=brief,
    metric_fn=eval_fn,
    dataset=eval_df,                  # pandas DataFrame
    stratify_column="category",       # stratified sampling
    sample_size=50,                   # tier-1 sample size
    confirmation_size=200,            # tier-2 confirmation sample
    repo_root=Path("."),
    max_iterations=20,
    patience=5,                       # stop after N iterations without improvement
    checks=["pytest tests/unit/ -x"], # shell commands run before eval
    agent_model="claude-opus-4-6",
    agent_max_turns=15,
    allowed_tools=["Read", "Write", "Edit", "Glob", "Grep"],
    checkpoint_path=Path(".autoresearch/checkpoint.json"),
    scope_paths=["src/agents/", "prompts/"],  # limit git ops to these paths
)

result = await optimizer.optimize()

Key Features

Stratified subsampling -- uses stratify_column to ensure balanced evaluation across categories. Progressive difficulty oversamples from historically weak categories.

Experiment insights -- each result is classified as worked/failed/promising and injected into the next iteration's prompt so the agent learns from history.

Failure analysis -- per-category score breakdowns are computed and fed back to the agent.

Rollback on deep regression -- if scores drop by more than 15% for 3 consecutive iterations, reverts to the best checkpoint.

Checkpoint resume -- pass checkpoint_path to resume interrupted runs. The optimizer rebuilds experiment insights from history on resume.

Scope isolation -- scope_paths limits git keep/discard to specific directories, protecting other agents' concurrent work.

macOS only

AutoResearchOptimizer requires the Claude Agent SDK, which is currently macOS-only. The agent runs without a shell tool by default -- quality checks are executed by the optimizer process, not the agent.


Analyze and Compare Results

Trajectory Analysis

from latent.optimize import analyze_trajectory

report = analyze_trajectory(result)
# report.metrics includes:
#   - improvement: delta between first and last kept score
#   - kept_rate: fraction of experiments kept (with Wilson CI)
#   - convergence_delta: score change over last 20% of experiments

Compare Two Optimizers

from latent.optimize import compare_optimizers

comparison = compare_optimizers(result_dspy, result_ace)
# ComparisonResult with delta, CI, p-value, effect size

Visualization

from latent.optimize import plot_optimization_progress, plot_compare_optimizers

# Single run: green=kept, grey=discarded, red=failed, navy line=running best
fig = plot_optimization_progress(result, title="DSPy MIPROv2")

# Overlay multiple runs
fig = plot_compare_optimizers(
    [result_dspy, result_ace, result_combined],
    labels=["DSPy", "ACE", "Combined"],
)

Common Pattern: Optimize, Apply, Evaluate

The standard workflow across all optimizers:

from latent.optimize import DSPyOptimizer, OptimizedPrompt, apply_optimized_prompt

# 1. Optimize
opt = DSPyOptimizer(agent, teleprompter="mipro_v2")
result = opt.optimize(train_data=train, metric=my_metric, max_iterations=10)

# 2. Extract best prompt
best = OptimizedPrompt.from_best(result)

# 3. Save artifact for reproducibility
best.save(Path("artifacts/optimized_prompt.json"))

# 4. Apply to agent
apply_optimized_prompt(agent, best)

# 5. Evaluate on held-out data
from latent.flows.judge_flow import judge_flow
eval_result = judge_flow(eval_data=test_df, judge=agent, gates={"quality": 3.5})

Loading a Saved Prompt

prompt = OptimizedPrompt.from_file(Path("artifacts/optimized_prompt.json"))
apply_optimized_prompt(agent, prompt)

ExperimentTracker

All optimizers use ExperimentTracker for thread-safe recording, checkpointing, and MLflow integration:

from latent.optimize import ExperimentTracker

tracker = ExperimentTracker(
    lower_is_better=False,
    checkpoint_path=Path("checkpoints/run.json"),
    auto_checkpoint=True,
    on_experiment=lambda exp: print(f"{exp.label}: {exp.score:.4f}"),
)

# Share a tracker across optimizers
opt = DSPyOptimizer(agent, tracker=tracker)

Resume from a checkpoint:

tracker = ExperimentTracker.from_checkpoint(Path("checkpoints/run.json"))