Prompt Optimizers¶

Latent ships four prompt optimizers that improve agent and judge prompts through automated experimentation. Each optimizer records every trial via ExperimentTracker, integrates with MLflow, and produces an OptimizationResult you can analyze, compare, and persist.

When to Use Which Optimizer¶

Optimizer	Best for	Method	Extras
`DSPyOptimizer`	Few-shot selection, instruction tuning	DSPy teleprompters	`latent[optimizers]`
`ACEOptimizer`	Section-level prompt refinement	ACE skillbook adaptation	`latent[optimizers]`
`CombinedOptimizer`	End-to-end two-phase optimization	DSPy candidate generation, then ACE calibration	`latent[optimizers]`
`AutoResearchOptimizer`	Codebase-level optimization (prompts, tools, schemas)	Claude Agent SDK + git keep/discard loop	`latent[autoresearch]`

Start with DSPyOptimizer

For most prompt-tuning tasks, DSPyOptimizer with mipro_v2 is the fastest path to measurable improvement. Graduate to CombinedOptimizer or AutoResearchOptimizer when you need structural changes beyond prompt wording.

Installation¶

# DSPy, ACE, and Combined optimizers
pip install "latent[optimizers]"

# AutoResearch optimizer (Claude Agent SDK, macOS only)
pip install "latent[autoresearch]"

# Both
pip install "latent[optimizers,autoresearch]"

Core Data Model¶

All optimizers share these types from latent.optimize.base:

from latent.optimize import (
    Optimizer,              # Protocol: any class with .optimize()
    OptimizationResult,     # Full run: experiments, best_score, best_label
    ExperimentResult,       # Single trial: label, score, kept, error, metadata
    OptimizedPrompt,        # Serializable artifact: prompt_template, few_shot_demos
    apply_optimized_prompt, # Apply OptimizedPrompt to an agent in-place
    ExperimentTracker,      # Thread-safe tracker with MLflow + checkpoint support
)

OptimizationResult (aliased as OptimizationReport) carries trial-level detail:

Field	Type	Description
`backend`	`str`	Which optimizer produced this (`"dspy"`, `"ace"`, `"combined"`, `"autoresearch"`)
`experiments`	`list[ExperimentResult]`	Every trial with score, kept/discarded status, metadata
`best_score`	`float`	Best score achieved
`best_label`	`str`	Label of the best experiment
`best_config`	`dict`	Configuration that produced the best result
`duration_s`	`float`	Wall-clock time

DSPyOptimizer¶

Wraps a BaseAgent or Judge in a DSPy module and runs a teleprompter to optimize instructions and few-shot demonstrations.

Supported Teleprompters¶

Teleprompter	Strategy	Key parameter
`mipro_v2`	Bayesian instruction + demo search	`num_trials`
`copro`	Cooperative prompt optimization	`breadth`
`bootstrap_fewshot`	Bootstrap demonstrations from teacher	`max_bootstrapped_demos`
`simba`	Step-wise instruction bootstrapping	`num_steps`
`gepa`	Genetic prompt algorithm	`num_iterations`

Usage¶

from latent.optimize import DSPyOptimizer, apply_optimized_prompt, OptimizedPrompt
from latent.agents import Judge
from latent.agents.scores import ScoredModel, OrdinalScore
from typing import Annotated

class Scores(ScoredModel):
    quality: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]

judge = Judge("qa_judge", model="gpt-4o", output_type=Scores)

opt = DSPyOptimizer(
    judge,
    teleprompter="mipro_v2",
    teacher_model="gpt-4o",      # optional: model for generating demos
    student_model="gpt-4o-mini", # optional: model for the student
    train_split=0.8,
    seed=42,
)

def metric(example, prediction, trace=None) -> float:
    return float(prediction.quality >= 3)

result = opt.optimize(
    train_data=data,  # list[dict] with input/output pairs
    metric=metric,
    max_iterations=10,
)

print(f"Best score: {result.best_score}")

# Extract and apply the optimized prompt
best_prompt = OptimizedPrompt.from_best(result)
apply_optimized_prompt(judge, best_prompt)

Configuration¶

Pass additional teleprompter-specific keyword arguments directly to the constructor:

opt = DSPyOptimizer(
    agent,
    teleprompter="mipro_v2",
    # Extra kwargs forwarded to dspy.MIPROv2(...)
    teacher_settings={"lm": my_teacher_lm},
)

Metric signature

DSPy metrics use the signature (example, prediction, trace) -> float. This differs from the ACE convention. When using CombinedOptimizer, pass dspy_metric and ace_metric separately.

ACEOptimizer¶

Uses the ACE (Automatic Calibration Engine) framework for section-level prompt refinement. The prompt is decomposed into a skillbook -- a collection of named sections that ACE adapts independently based on evaluation feedback.

Skillbook Concept¶

A skillbook is a structured prompt divided into sections. Each SectionConfig defines one section:

from latent.optimize.ace.skillbook import SectionConfig

sections = [
    SectionConfig(
        name="instructions",
        description="Core task instructions",
        initial_content="You are a quality scorer...",
        max_skills=10,
    ),
    SectionConfig(
        name="rubric",
        description="Scoring rubric and criteria",
        initial_content="Score from 1-5 based on...",
    ),
    SectionConfig(
        name="examples",
        description="Few-shot examples",
        initial_content="Example 1: ...",
    ),
]

Each round, ACE evaluates the agent, then calls skillbook.adapt(feedback) to refine individual sections based on what worked and what did not.

Usage¶

from latent.optimize import ACEOptimizer
from latent.optimize.ace.adapter import ACEAdapterConfig

opt = ACEOptimizer(
    agent,
    sections=sections,  # optional: defaults to single "instructions" section
    adapter_config=ACEAdapterConfig(
        cache_enabled=True,
        max_retries=3,
        timeout=30.0,
    ),
)

def metric(expected: str, predicted: str) -> float:
    return 1.0 if expected.strip() == predicted.strip() else 0.0

result = opt.optimize(
    train_data=data,
    metric=metric,          # (expected, predicted) -> float
    max_iterations=10,
)

ACE framework dependency

If ace-framework is not installed, the optimizer runs with a mock skillbook that does not adapt between rounds. The [optimizers] extra already bundles ace-framework, so pip install "latent[optimizers]" enables real adaptation.

CombinedOptimizer¶

Chains DSPy (candidate generation) and ACE (calibration) in a two-phase pipeline. A single ExperimentTracker spans both phases.

Phase 1 -- DSPy: generates candidate prompts via teleprompter optimization. Phase 2 -- ACE: takes the best DSPy candidate and refines it section by section.

Iterations are split evenly between the two phases (max_iterations // 2 each).

Usage¶

from latent.optimize import CombinedOptimizer

opt = CombinedOptimizer(
    agent,
    dspy_config={
        "teleprompter": "mipro_v2",
        "teacher_model": "gpt-4o",
    },
    ace_config={
        "sections": [
            {"name": "instructions", "description": "Core instructions", "initial_content": "..."},
        ],
        "adapter_config": {"cache_enabled": True},
    },
)

result = opt.optimize(
    train_data=data,
    dspy_metric=dspy_metric,  # (example, prediction, trace) -> float
    ace_metric=ace_metric,    # (expected_str, predicted_str) -> float
    max_iterations=20,        # 10 DSPy + 10 ACE
)

Separate metrics

DSPy and ACE use incompatible metric signatures. Pass dspy_metric and ace_metric separately. If you pass a single metric, it must satisfy both calling conventions.

AutoResearchOptimizer¶

An autonomous code optimizer that uses the Claude Agent SDK to read your codebase, propose changes, and keep or discard them based on metric evaluation. Unlike the other optimizers which tune prompt text, AutoResearch modifies actual source files (prompts, tool definitions, schemas) in a git-managed loop.

How It Works¶

Each iteration:

Agent session -- Claude reads the codebase and applies one targeted change.
Quality checks -- runs pytest or other shell commands (controlled by the optimizer, not the agent).
Metric evaluation -- calls your async eval_fn on a dataset sample.
Progressive confirmation -- optionally re-evaluates on a larger sample if the initial result improves.
Git decision -- commits (keep) or hard-resets (discard).
Insight tracking -- classifies each experiment and injects analysis into the next iteration.

Research Brief¶

The ResearchBrief defines the optimization objective:

from latent.optimize.autoresearch.brief import ResearchBrief

brief = ResearchBrief(
    objective="Improve the text-to-SQL agent's query accuracy",
    metric_name="exact_match",
    direction="higher_is_better",
    relevant_files=[
        "src/agents/sql_agent.py",
        "src/tools/schema_lookup.py",
        "prompts/sql_system.md",
    ],
    constraints=[
        "Do not modify test files",
        "Keep latency under 5 seconds per query",
    ],
    prior_context="The agent currently struggles with JOIN queries.",
    ideas=["Try adding schema examples to the prompt"],
)

Usage¶

from latent.optimize.autoresearch.optimizer import AutoResearchOptimizer
from pathlib import Path

async def eval_fn(sample: list[dict]) -> dict[str, float]:
    # Run your evaluation, return metric dict
    correct = sum(1 for row in sample if row["predicted"] == row["expected"])
    return {"exact_match": correct / len(sample)}

optimizer = AutoResearchOptimizer(
    brief=brief,
    eval_fn=eval_fn,
    dataset=eval_df,                  # pandas DataFrame
    stratify_column="category",       # stratified sampling
    sample_size=50,                   # tier-1 sample size
    confirmation_size=200,            # tier-2 confirmation sample
    repo_root=Path("."),
    max_iterations=20,
    patience=5,                       # stop after N iterations without improvement
    checks=["pytest tests/unit/ -x"], # shell commands run before eval
    agent_model="claude-opus-4-6",
    agent_max_turns=15,
    allowed_tools=["Read", "Write", "Edit", "Glob", "Grep"],  # custom subset; see note below
    checkpoint_path=Path(".autoresearch/checkpoint.json"),
    scope_paths=["src/agents/", "prompts/"],  # limit git ops to these paths
)

result = await optimizer.optimize()

The allowed_tools value above is a custom subset, not the default. When you omit allowed_tools, the optimizer uses DEFAULT_ALLOWED_TOOLS, which is ["Read", "Write", "Edit", "Glob", "Grep", "WebSearch", "WebFetch"] (seven tools). Bash is excluded by default so the agent cannot run destructive shell commands — quality-gate commands are executed by the optimizer itself via checks.

Key Features¶

Stratified subsampling -- uses stratify_column to ensure balanced evaluation across categories. Progressive difficulty oversamples from historically weak categories.

Experiment insights -- each result is classified as worked/failed/promising and injected into the next iteration's prompt so the agent learns from history.

Failure analysis -- per-category score breakdowns are computed and fed back to the agent.

Rollback on deep regression -- if scores drop by more than 15% for 3 consecutive iterations, reverts to the best checkpoint.

Checkpoint resume -- pass checkpoint_path to resume interrupted runs. The optimizer rebuilds experiment insights from history on resume.

Scope isolation -- scope_paths limits git keep/discard to specific directories, protecting other agents' concurrent work.

macOS only

AutoResearchOptimizer requires the Claude Agent SDK, which is currently macOS-only. The agent runs without a shell tool by default -- quality checks are executed by the optimizer process, not the agent.

Analyze and Compare Results¶

Trajectory Analysis¶

from latent.optimize import analyze_trajectory

report = analyze_trajectory(result)
# report.metrics includes:
#   - improvement: delta between first and last kept score
#   - kept_rate: fraction of experiments kept (with Wilson CI)
#   - convergence_delta: score change over last 20% of experiments

Compare Two Optimizers¶

from latent.optimize import compare_optimizers

comparison = compare_optimizers(result_dspy, result_ace)
# ComparisonResult with delta, CI, p-value, effect size

Visualization¶

from latent.optimize import plot_optimization_progress, plot_compare_optimizers

# Single run: green=kept, grey=discarded, red=failed, navy line=running best
fig = plot_optimization_progress(result, title="DSPy MIPROv2")

# Overlay multiple runs
fig = plot_compare_optimizers(
    [result_dspy, result_ace, result_combined],
    labels=["DSPy", "ACE", "Combined"],
)

Common Pattern: Optimize, Apply, Evaluate¶

The standard workflow across all optimizers:

from latent.optimize import DSPyOptimizer, OptimizedPrompt, apply_optimized_prompt

# 1. Optimize
opt = DSPyOptimizer(agent, teleprompter="mipro_v2")
result = opt.optimize(train_data=train, metric=my_metric, max_iterations=10)

# 2. Extract best prompt
best = OptimizedPrompt.from_best(result)

# 3. Save artifact for reproducibility
best.save(Path("artifacts/optimized_prompt.json"))

# 4. Apply to agent
apply_optimized_prompt(agent, best)

# 5. Evaluate on held-out data (judge_flow is async — await it)
from latent.flows.judge_flow import judge_flow
eval_result = await judge_flow(eval_data=test_df, judges=agent, gates={"quality": 3.5})

Loading a Saved Prompt¶

prompt = OptimizedPrompt.from_file(Path("artifacts/optimized_prompt.json"))
apply_optimized_prompt(agent, prompt)

Optimizing inside a flow¶

The end-to-end pattern — build a metric from a judge, optimize the agent's prompt, apply the best result, and re-evaluate — composes cleanly inside a @flow. The flow orchestrates loading data, running the optimizer, and gating the optimized agent.

@flow and @task are async-only in latent-py (the decorator raises if the wrapped function is not async def), so await any async call such as judge_flow. DSPyOptimizer.optimize(), however, is synchronous — call it directly without await.

from typing import Annotated

from latent.prefect import params, logger
from latent.prefect.decorators import flow as latent_flow
from latent.agents import ReActAgent, Judge
from latent.agents.scores import ScoredModel, OrdinalScore
from latent.optimize import DSPyOptimizer, OptimizedPrompt, apply_optimized_prompt
from latent.flows.judge_flow import judge_flow


class Scores(ScoredModel):
    quality: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]


@latent_flow("optimize_qa")
async def optimize_qa_flow(train_data, eval_data):
    # Agent under optimization and the judge that scores it.
    agent = ReActAgent("qa_agent", model="gpt-4o")
    judge = Judge("qa_judge", model="gpt-4o", output_type=Scores)

    # DSPy metric: (example, prediction, trace) -> float.
    def metric(example, prediction, trace=None) -> float:
        return float(getattr(prediction, "quality", 0) >= 3)

    # Optimize the agent's prompt. optimize() is SYNC — do not await it.
    opt = DSPyOptimizer(agent, teleprompter="mipro_v2")
    result = opt.optimize(train_data=train_data, metric=metric, max_iterations=10)
    logger.info(f"Best optimization score: {result.best_score:.4f}")

    # Apply the best prompt to the agent in-place.
    best_prompt = OptimizedPrompt.from_best(result)
    apply_optimized_prompt(agent, best_prompt)

    # Re-evaluate the optimized agent — judge_flow is async, so await it.
    eval_result = await judge_flow(eval_data, judge, gates={"quality": 3.5})
    logger.info(f"Post-optimization gates passed: {eval_result['all_passed']}")

    return {
        "best_score": result.best_score,
        "optimized_prompt": best_prompt,
        "all_passed": eval_result["all_passed"],
        "markdown": eval_result["markdown"],
    }

Building the metric from a judge

The DSPy metric above is a plain numeric function. To score with a Judge instead, call it inside an async metric and await it — but DSPyOptimizer.optimize() drives the metric synchronously, so reserve judge-backed scoring for the post-optimization judge_flow step (as shown) or for async optimizers. For DSPy, prefer a cheap synchronous metric over the agent's structured output.

apply_optimized_prompt takes an OptimizedPrompt

Convert the run's OptimizationResult with OptimizedPrompt.from_best(result) before applying. apply_optimized_prompt sets system_prompt on a ReActAgent (falling back to prompt_template when DSPy only emits a template) and prompt_template on a Judge.

ExperimentTracker¶

All optimizers use ExperimentTracker for thread-safe recording, checkpointing, and MLflow integration:

from latent.optimize import ExperimentTracker

tracker = ExperimentTracker(
    lower_is_better=False,
    checkpoint_path=Path("checkpoints/run.json"),
    auto_checkpoint=True,
    on_experiment=lambda exp: print(f"{exp.label}: {exp.score:.4f}"),
)

# Share a tracker across optimizers
opt = DSPyOptimizer(agent, tracker=tracker)

Resume from a checkpoint:

tracker = ExperimentTracker.from_checkpoint(Path("checkpoints/run.json"))

AutoResearch: Advanced Features¶

Completion Report¶

After an optimization run, generate a structured markdown report covering what worked, what failed, and parked ideas:

from latent.optimize.autoresearch.report import render_completion_report

report_md = render_completion_report(
    result,
    brief=brief,
    insights=optimizer._experiment_insights,  # list[str] collected during the run
)
# Sections: Executive Summary, Trial Table, What Worked, What Failed, Parked Ideas, Configuration

The report includes a trial table with per-experiment scores and kept/discarded/failed status, improvement deltas, and duration.

Strategy Document¶

Provide a strategy_md field on ResearchBrief to inject a high-level strategy document into the agent prompt. This guides the agent's approach without constraining specific changes:

brief = ResearchBrief(
    objective="Improve SQL query accuracy",
    metric_name="exact_match",
    direction="higher_is_better",
    relevant_files=["src/agents/sql_agent.py"],
    constraints=["Do not modify test files"],
    strategy_md="""\
Focus on JOIN queries first -- they account for 60% of failures.
Try schema-aware prompting before adding few-shot examples.
The current system prompt mixes directives with SQL examples; separate them.
""",
)

Overfitting Detection¶

Set validation_split to hold out a portion of the dataset for overfitting detection. The optimizer evaluates each kept experiment on the validation set and warns if the validation score diverges from the train score by more than 10%:

optimizer = AutoResearchOptimizer(
    brief=brief,
    eval_fn=eval_fn,
    dataset=eval_df,
    validation_split=0.2,  # 80% train, 20% validation
    max_iterations=20,
)

Validation split requires sufficient data

With small datasets, the split may produce unreliable validation scores. Use at least 100 examples in the full dataset.

Pre-Eval Verification¶

Quality checks (pytest, linting) run before metric evaluation, ensuring that proposed changes don't break the codebase:

optimizer = AutoResearchOptimizer(
    brief=brief,
    eval_fn=eval_fn,
    dataset=eval_df,
    checks=["pytest tests/unit/ -x", "ruff check src/"],  # must pass before eval
)

If any check command returns a non-zero exit code, the iteration is classified as failed and the change is discarded without running the (potentially expensive) metric evaluation.

Prompt Engineering Improvements¶

The AutoResearch agent prompt incorporates context engineering best practices:

KV-cache stability -- static instructions are separated from dynamic per-iteration content.
Edge-anchoring -- critical constraints appear at the beginning and end of the prompt.
Structured insights -- experiment history is formatted as structured blocks rather than free text.
Diff capping -- only the most recent diffs are shown in full; older ones are collapsed to one-liners.
Focus headlines -- each iteration gets a clear headline summarizing the current objective.
Simplicity tiebreaker -- when two approaches score equally, the simpler one is preferred.

Prompt Optimizers¶

When to Use Which Optimizer¶

Installation¶

Core Data Model¶

DSPyOptimizer¶

Supported Teleprompters¶

Usage¶

Configuration¶

ACEOptimizer¶

Skillbook Concept¶

Usage¶

CombinedOptimizer¶

Usage¶

AutoResearchOptimizer¶

How It Works¶

Research Brief¶

Usage¶

Key Features¶

Analyze and Compare Results¶

Trajectory Analysis¶

Compare Two Optimizers¶

Visualization¶

Common Pattern: Optimize, Apply, Evaluate¶

Loading a Saved Prompt¶

Optimizing inside a flow¶

ExperimentTracker¶

AutoResearch: Advanced Features¶

Completion Report¶

Strategy Document¶

Overfitting Detection¶

Pre-Eval Verification¶

Prompt Engineering Improvements¶

See Also¶