Prompt Optimizers¶

Latent ships four prompt optimizers that improve agent and judge prompts through automated experimentation. Each optimizer records every trial via ExperimentTracker, integrates with MLflow, and produces an OptimizationResult you can analyze, compare, and persist.

When to Use Which Optimizer¶

Optimizer	Best for	Method	Extras
`DSPyOptimizer`	Few-shot selection, instruction tuning	DSPy teleprompters	`latent[optimizers]`
`ACEOptimizer`	Section-level prompt refinement	ACE skillbook adaptation	`latent[optimizers]`
`CombinedOptimizer`	End-to-end two-phase optimization	DSPy candidate generation, then ACE calibration	`latent[optimizers]`
`AutoResearchOptimizer`	Codebase-level optimization (prompts, tools, schemas)	Claude Agent SDK + git keep/discard loop	`latent[autoresearch]`

Start with DSPyOptimizer

For most prompt-tuning tasks, DSPyOptimizer with mipro_v2 is the fastest path to measurable improvement. Graduate to CombinedOptimizer or AutoResearchOptimizer when you need structural changes beyond prompt wording.

Installation¶

# DSPy, ACE, and Combined optimizers
pip install "latent[optimizers]"

# AutoResearch optimizer (Claude Agent SDK, macOS only)
pip install "latent[autoresearch]"

# Both
pip install "latent[optimizers,autoresearch]"

Core Data Model¶

All optimizers share these types from latent.optimize.base:

from latent.optimize import (
    Optimizer,              # Protocol: any class with .optimize()
    OptimizationResult,     # Full run: experiments, best_score, best_label
    ExperimentResult,       # Single trial: label, score, kept, error, metadata
    OptimizedPrompt,        # Serializable artifact: prompt_template, few_shot_demos
    apply_optimized_prompt, # Apply OptimizedPrompt to an agent in-place
    ExperimentTracker,      # Thread-safe tracker with MLflow + checkpoint support
)

OptimizationResult (aliased as OptimizationReport) carries trial-level detail:

Field	Type	Description
`backend`	`str`	Which optimizer produced this (`"dspy"`, `"ace"`, `"combined"`, `"autoresearch"`)
`experiments`	`list[ExperimentResult]`	Every trial with score, kept/discarded status, metadata
`best_score`	`float`	Best score achieved
`best_label`	`str`	Label of the best experiment
`best_config`	`dict`	Configuration that produced the best result
`duration_s`	`float`	Wall-clock time

DSPyOptimizer¶

Wraps a BaseAgent or Judge in a DSPy module and runs a teleprompter to optimize instructions and few-shot demonstrations.

Supported Teleprompters¶

Teleprompter	Strategy	Key parameter
`mipro_v2`	Bayesian instruction + demo search	`num_trials`
`copro`	Cooperative prompt optimization	`breadth`
`bootstrap_fewshot`	Bootstrap demonstrations from teacher	`max_bootstrapped_demos`
`simba`	Step-wise instruction bootstrapping	`num_steps`
`gepa`	Genetic prompt algorithm	`num_iterations`

Usage¶

from latent.optimize import DSPyOptimizer, apply_optimized_prompt, OptimizedPrompt
from latent.agents import Judge
from latent.agents.scores import ScoredModel, OrdinalScore
from typing import Annotated

class Scores(ScoredModel):
    quality: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]

judge = Judge("qa_judge", model="gpt-4o", output_type=Scores)

opt = DSPyOptimizer(
    judge,
    teleprompter="mipro_v2",
    teacher_model="gpt-4o",      # optional: model for generating demos
    student_model="gpt-4o-mini", # optional: model for the student
    train_split=0.8,
    seed=42,
)

def metric(example, prediction, trace=None) -> float:
    return float(prediction.quality >= 3)

result = opt.optimize(
    train_data=data,  # list[dict] with input/output pairs
    metric=metric,
    max_iterations=10,
)

print(f"Best score: {result.best_score}")

# Extract and apply the optimized prompt
best_prompt = OptimizedPrompt.from_best(result)
apply_optimized_prompt(judge, best_prompt)

Configuration¶

Pass additional teleprompter-specific keyword arguments directly to the constructor:

opt = DSPyOptimizer(
    agent,
    teleprompter="mipro_v2",
    # Extra kwargs forwarded to dspy.MIPROv2(...)
    teacher_settings={"lm": my_teacher_lm},
)

Metric signature

DSPy metrics use the signature (example, prediction, trace) -> float. This differs from the ACE convention. When using CombinedOptimizer, pass dspy_metric and ace_metric separately.

ACEOptimizer¶

Uses the ACE (Automatic Calibration Engine) framework for section-level prompt refinement. The prompt is decomposed into a skillbook -- a collection of named sections that ACE adapts independently based on evaluation feedback.

Skillbook Concept¶

A skillbook is a structured prompt divided into sections. Each SectionConfig defines one section:

from latent.optimize.ace.skillbook import SectionConfig

sections = [
    SectionConfig(
        name="instructions",
        description="Core task instructions",
        initial_content="You are a quality scorer...",
        max_skills=10,
    ),
    SectionConfig(
        name="rubric",
        description="Scoring rubric and criteria",
        initial_content="Score from 1-5 based on...",
    ),
    SectionConfig(
        name="examples",
        description="Few-shot examples",
        initial_content="Example 1: ...",
    ),
]

Each round, ACE evaluates the agent, then calls skillbook.adapt(feedback) to refine individual sections based on what worked and what did not.

Usage¶

from latent.optimize import ACEOptimizer
from latent.optimize.ace.adapter import ACEAdapterConfig

opt = ACEOptimizer(
    agent,
    sections=sections,  # optional: defaults to single "instructions" section
    adapter_config=ACEAdapterConfig(
        cache_enabled=True,
        max_retries=3,
        timeout=30.0,
    ),
)

def metric(expected: str, predicted: str) -> float:
    return 1.0 if expected.strip() == predicted.strip() else 0.0

result = opt.optimize(
    train_data=data,
    metric=metric,          # (expected, predicted) -> float
    max_iterations=10,
)

ACE framework dependency

If ace-framework is not installed, the optimizer runs with a mock skillbook that does not adapt between rounds. Install ace-framework separately for real adaptation.

CombinedOptimizer¶

Chains DSPy (candidate generation) and ACE (calibration) in a two-phase pipeline. A single ExperimentTracker spans both phases.

Phase 1 -- DSPy: generates candidate prompts via teleprompter optimization. Phase 2 -- ACE: takes the best DSPy candidate and refines it section by section.

Iterations are split evenly between the two phases (max_iterations // 2 each).

Usage¶

from latent.optimize import CombinedOptimizer

opt = CombinedOptimizer(
    agent,
    dspy_config={
        "teleprompter": "mipro_v2",
        "teacher_model": "gpt-4o",
    },
    ace_config={
        "sections": [
            {"name": "instructions", "description": "Core instructions", "initial_content": "..."},
        ],
        "adapter_config": {"cache_enabled": True},
    },
)

result = opt.optimize(
    train_data=data,
    dspy_metric=dspy_metric,  # (example, prediction, trace) -> float
    ace_metric=ace_metric,    # (expected_str, predicted_str) -> float
    max_iterations=20,        # 10 DSPy + 10 ACE
)

Separate metrics

DSPy and ACE use incompatible metric signatures. Pass dspy_metric and ace_metric separately. If you pass a single metric, it must satisfy both calling conventions.

AutoResearchOptimizer¶

An autonomous code optimizer that uses the Claude Agent SDK to read your codebase, propose changes, and keep or discard them based on metric evaluation. Unlike the other optimizers which tune prompt text, AutoResearch modifies actual source files (prompts, tool definitions, schemas) in a git-managed loop.

How It Works¶

Each iteration:

Agent session -- Claude reads the codebase and applies one targeted change.
Quality checks -- runs pytest or other shell commands (controlled by the optimizer, not the agent).
Metric evaluation -- calls your async metric_fn on a dataset sample.
Progressive confirmation -- optionally re-evaluates on a larger sample if the initial result improves.
Git decision -- commits (keep) or hard-resets (discard).
Insight tracking -- classifies each experiment and injects analysis into the next iteration.

Research Brief¶

The ResearchBrief defines the optimization objective:

from latent.optimize.autoresearch.brief import ResearchBrief

brief = ResearchBrief(
    objective="Improve the text-to-SQL agent's query accuracy",
    metric_name="exact_match",
    direction="higher_is_better",
    relevant_files=[
        "src/agents/sql_agent.py",
        "src/tools/schema_lookup.py",
        "prompts/sql_system.md",
    ],
    constraints=[
        "Do not modify test files",
        "Keep latency under 5 seconds per query",
    ],
    prior_context="The agent currently struggles with JOIN queries.",
    ideas=["Try adding schema examples to the prompt"],
)

Usage¶

from latent.optimize.autoresearch.optimizer import AutoResearchOptimizer
from pathlib import Path

async def eval_fn(sample: list[dict]) -> dict[str, float]:
    # Run your evaluation, return metric dict
    correct = sum(1 for row in sample if row["predicted"] == row["expected"])
    return {"exact_match": correct / len(sample)}

optimizer = AutoResearchOptimizer(
    brief=brief,
    metric_fn=eval_fn,
    dataset=eval_df,                  # pandas DataFrame
    stratify_column="category",       # stratified sampling
    sample_size=50,                   # tier-1 sample size
    confirmation_size=200,            # tier-2 confirmation sample
    repo_root=Path("."),
    max_iterations=20,
    patience=5,                       # stop after N iterations without improvement
    checks=["pytest tests/unit/ -x"], # shell commands run before eval
    agent_model="claude-opus-4-6",
    agent_max_turns=15,
    allowed_tools=["Read", "Write", "Edit", "Glob", "Grep"],
    checkpoint_path=Path(".autoresearch/checkpoint.json"),
    scope_paths=["src/agents/", "prompts/"],  # limit git ops to these paths
)

result = await optimizer.optimize()

Key Features¶

Stratified subsampling -- uses stratify_column to ensure balanced evaluation across categories. Progressive difficulty oversamples from historically weak categories.

Experiment insights -- each result is classified as worked/failed/promising and injected into the next iteration's prompt so the agent learns from history.

Failure analysis -- per-category score breakdowns are computed and fed back to the agent.

Rollback on deep regression -- if scores drop by more than 15% for 3 consecutive iterations, reverts to the best checkpoint.

Checkpoint resume -- pass checkpoint_path to resume interrupted runs. The optimizer rebuilds experiment insights from history on resume.

Scope isolation -- scope_paths limits git keep/discard to specific directories, protecting other agents' concurrent work.

macOS only

AutoResearchOptimizer requires the Claude Agent SDK, which is currently macOS-only. The agent runs without a shell tool by default -- quality checks are executed by the optimizer process, not the agent.

Analyze and Compare Results¶

Trajectory Analysis¶

from latent.optimize import analyze_trajectory

report = analyze_trajectory(result)
# report.metrics includes:
#   - improvement: delta between first and last kept score
#   - kept_rate: fraction of experiments kept (with Wilson CI)
#   - convergence_delta: score change over last 20% of experiments

Compare Two Optimizers¶

from latent.optimize import compare_optimizers

comparison = compare_optimizers(result_dspy, result_ace)
# ComparisonResult with delta, CI, p-value, effect size

Visualization¶

from latent.optimize import plot_optimization_progress, plot_compare_optimizers

# Single run: green=kept, grey=discarded, red=failed, navy line=running best
fig = plot_optimization_progress(result, title="DSPy MIPROv2")

# Overlay multiple runs
fig = plot_compare_optimizers(
    [result_dspy, result_ace, result_combined],
    labels=["DSPy", "ACE", "Combined"],
)

Common Pattern: Optimize, Apply, Evaluate¶

The standard workflow across all optimizers:

from latent.optimize import DSPyOptimizer, OptimizedPrompt, apply_optimized_prompt

# 1. Optimize
opt = DSPyOptimizer(agent, teleprompter="mipro_v2")
result = opt.optimize(train_data=train, metric=my_metric, max_iterations=10)

# 2. Extract best prompt
best = OptimizedPrompt.from_best(result)

# 3. Save artifact for reproducibility
best.save(Path("artifacts/optimized_prompt.json"))

# 4. Apply to agent
apply_optimized_prompt(agent, best)

# 5. Evaluate on held-out data
from latent.flows.judge_flow import judge_flow
eval_result = judge_flow(eval_data=test_df, judge=agent, gates={"quality": 3.5})

Loading a Saved Prompt¶

prompt = OptimizedPrompt.from_file(Path("artifacts/optimized_prompt.json"))
apply_optimized_prompt(agent, prompt)

ExperimentTracker¶

All optimizers use ExperimentTracker for thread-safe recording, checkpointing, and MLflow integration:

from latent.optimize import ExperimentTracker

tracker = ExperimentTracker(
    lower_is_better=False,
    checkpoint_path=Path("checkpoints/run.json"),
    auto_checkpoint=True,
    on_experiment=lambda exp: print(f"{exp.label}: {exp.score:.4f}"),
)

# Share a tracker across optimizers
opt = DSPyOptimizer(agent, tracker=tracker)

Resume from a checkpoint:

tracker = ExperimentTracker.from_checkpoint(Path("checkpoints/run.json"))