Prompt Optimizers¶
Latent ships four prompt optimizers that improve agent and judge prompts through automated experimentation. Each optimizer records every trial via ExperimentTracker, integrates with MLflow, and produces an OptimizationResult you can analyze, compare, and persist.
When to Use Which Optimizer¶
| Optimizer | Best for | Method | Extras |
|---|---|---|---|
DSPyOptimizer |
Few-shot selection, instruction tuning | DSPy teleprompters | latent[optimizers] |
ACEOptimizer |
Section-level prompt refinement | ACE skillbook adaptation | latent[optimizers] |
CombinedOptimizer |
End-to-end two-phase optimization | DSPy candidate generation, then ACE calibration | latent[optimizers] |
AutoResearchOptimizer |
Codebase-level optimization (prompts, tools, schemas) | Claude Agent SDK + git keep/discard loop | latent[autoresearch] |
Start with DSPyOptimizer
For most prompt-tuning tasks, DSPyOptimizer with mipro_v2 is the fastest path to measurable improvement. Graduate to CombinedOptimizer or AutoResearchOptimizer when you need structural changes beyond prompt wording.
Installation¶
# DSPy, ACE, and Combined optimizers
pip install "latent[optimizers]"
# AutoResearch optimizer (Claude Agent SDK, macOS only)
pip install "latent[autoresearch]"
# Both
pip install "latent[optimizers,autoresearch]"
Core Data Model¶
All optimizers share these types from latent.optimize.base:
from latent.optimize import (
Optimizer, # Protocol: any class with .optimize()
OptimizationResult, # Full run: experiments, best_score, best_label
ExperimentResult, # Single trial: label, score, kept, error, metadata
OptimizedPrompt, # Serializable artifact: prompt_template, few_shot_demos
apply_optimized_prompt, # Apply OptimizedPrompt to an agent in-place
ExperimentTracker, # Thread-safe tracker with MLflow + checkpoint support
)
OptimizationResult (aliased as OptimizationReport) carries trial-level detail:
| Field | Type | Description |
|---|---|---|
backend |
str |
Which optimizer produced this ("dspy", "ace", "combined", "autoresearch") |
experiments |
list[ExperimentResult] |
Every trial with score, kept/discarded status, metadata |
best_score |
float |
Best score achieved |
best_label |
str |
Label of the best experiment |
best_config |
dict |
Configuration that produced the best result |
duration_s |
float |
Wall-clock time |
DSPyOptimizer¶
Wraps a BaseAgent or Judge in a DSPy module and runs a teleprompter to optimize instructions and few-shot demonstrations.
Supported Teleprompters¶
| Teleprompter | Strategy | Key parameter |
|---|---|---|
mipro_v2 |
Bayesian instruction + demo search | num_trials |
copro |
Cooperative prompt optimization | breadth |
bootstrap_fewshot |
Bootstrap demonstrations from teacher | max_bootstrapped_demos |
simba |
Step-wise instruction bootstrapping | num_steps |
gepa |
Genetic prompt algorithm | num_iterations |
Usage¶
from latent.optimize import DSPyOptimizer, apply_optimized_prompt, OptimizedPrompt
from latent.agents import Judge
from latent.agents.scores import ScoredModel, OrdinalScore
from typing import Annotated
class Scores(ScoredModel):
quality: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]
judge = Judge("qa_judge", model="gpt-4o", output_type=Scores)
opt = DSPyOptimizer(
judge,
teleprompter="mipro_v2",
teacher_model="gpt-4o", # optional: model for generating demos
student_model="gpt-4o-mini", # optional: model for the student
train_split=0.8,
seed=42,
)
def metric(example, prediction, trace=None) -> float:
return float(prediction.quality >= 3)
result = opt.optimize(
train_data=data, # list[dict] with input/output pairs
metric=metric,
max_iterations=10,
)
print(f"Best score: {result.best_score}")
# Extract and apply the optimized prompt
best_prompt = OptimizedPrompt.from_best(result)
apply_optimized_prompt(judge, best_prompt)
Configuration¶
Pass additional teleprompter-specific keyword arguments directly to the constructor:
opt = DSPyOptimizer(
agent,
teleprompter="mipro_v2",
# Extra kwargs forwarded to dspy.MIPROv2(...)
teacher_settings={"lm": my_teacher_lm},
)
Metric signature
DSPy metrics use the signature (example, prediction, trace) -> float. This differs from the ACE convention. When using CombinedOptimizer, pass dspy_metric and ace_metric separately.
ACEOptimizer¶
Uses the ACE (Automatic Calibration Engine) framework for section-level prompt refinement. The prompt is decomposed into a skillbook -- a collection of named sections that ACE adapts independently based on evaluation feedback.
Skillbook Concept¶
A skillbook is a structured prompt divided into sections. Each SectionConfig defines one section:
from latent.optimize.ace.skillbook import SectionConfig
sections = [
SectionConfig(
name="instructions",
description="Core task instructions",
initial_content="You are a quality scorer...",
max_skills=10,
),
SectionConfig(
name="rubric",
description="Scoring rubric and criteria",
initial_content="Score from 1-5 based on...",
),
SectionConfig(
name="examples",
description="Few-shot examples",
initial_content="Example 1: ...",
),
]
Each round, ACE evaluates the agent, then calls skillbook.adapt(feedback) to refine individual sections based on what worked and what did not.
Usage¶
from latent.optimize import ACEOptimizer
from latent.optimize.ace.adapter import ACEAdapterConfig
opt = ACEOptimizer(
agent,
sections=sections, # optional: defaults to single "instructions" section
adapter_config=ACEAdapterConfig(
cache_enabled=True,
max_retries=3,
timeout=30.0,
),
)
def metric(expected: str, predicted: str) -> float:
return 1.0 if expected.strip() == predicted.strip() else 0.0
result = opt.optimize(
train_data=data,
metric=metric, # (expected, predicted) -> float
max_iterations=10,
)
ACE framework dependency
If ace-framework is not installed, the optimizer runs with a mock skillbook that does not adapt between rounds. Install ace-framework separately for real adaptation.
CombinedOptimizer¶
Chains DSPy (candidate generation) and ACE (calibration) in a two-phase pipeline. A single ExperimentTracker spans both phases.
Phase 1 -- DSPy: generates candidate prompts via teleprompter optimization. Phase 2 -- ACE: takes the best DSPy candidate and refines it section by section.
Iterations are split evenly between the two phases (max_iterations // 2 each).
Usage¶
from latent.optimize import CombinedOptimizer
opt = CombinedOptimizer(
agent,
dspy_config={
"teleprompter": "mipro_v2",
"teacher_model": "gpt-4o",
},
ace_config={
"sections": [
{"name": "instructions", "description": "Core instructions", "initial_content": "..."},
],
"adapter_config": {"cache_enabled": True},
},
)
result = opt.optimize(
train_data=data,
dspy_metric=dspy_metric, # (example, prediction, trace) -> float
ace_metric=ace_metric, # (expected_str, predicted_str) -> float
max_iterations=20, # 10 DSPy + 10 ACE
)
Separate metrics
DSPy and ACE use incompatible metric signatures. Pass dspy_metric and ace_metric separately. If you pass a single metric, it must satisfy both calling conventions.
AutoResearchOptimizer¶
An autonomous code optimizer that uses the Claude Agent SDK to read your codebase, propose changes, and keep or discard them based on metric evaluation. Unlike the other optimizers which tune prompt text, AutoResearch modifies actual source files (prompts, tool definitions, schemas) in a git-managed loop.
How It Works¶
Each iteration:
- Agent session -- Claude reads the codebase and applies one targeted change.
- Quality checks -- runs pytest or other shell commands (controlled by the optimizer, not the agent).
- Metric evaluation -- calls your async
metric_fnon a dataset sample. - Progressive confirmation -- optionally re-evaluates on a larger sample if the initial result improves.
- Git decision -- commits (keep) or hard-resets (discard).
- Insight tracking -- classifies each experiment and injects analysis into the next iteration.
Research Brief¶
The ResearchBrief defines the optimization objective:
from latent.optimize.autoresearch.brief import ResearchBrief
brief = ResearchBrief(
objective="Improve the text-to-SQL agent's query accuracy",
metric_name="exact_match",
direction="higher_is_better",
relevant_files=[
"src/agents/sql_agent.py",
"src/tools/schema_lookup.py",
"prompts/sql_system.md",
],
constraints=[
"Do not modify test files",
"Keep latency under 5 seconds per query",
],
prior_context="The agent currently struggles with JOIN queries.",
ideas=["Try adding schema examples to the prompt"],
)
Usage¶
from latent.optimize.autoresearch.optimizer import AutoResearchOptimizer
from pathlib import Path
async def eval_fn(sample: list[dict]) -> dict[str, float]:
# Run your evaluation, return metric dict
correct = sum(1 for row in sample if row["predicted"] == row["expected"])
return {"exact_match": correct / len(sample)}
optimizer = AutoResearchOptimizer(
brief=brief,
metric_fn=eval_fn,
dataset=eval_df, # pandas DataFrame
stratify_column="category", # stratified sampling
sample_size=50, # tier-1 sample size
confirmation_size=200, # tier-2 confirmation sample
repo_root=Path("."),
max_iterations=20,
patience=5, # stop after N iterations without improvement
checks=["pytest tests/unit/ -x"], # shell commands run before eval
agent_model="claude-opus-4-6",
agent_max_turns=15,
allowed_tools=["Read", "Write", "Edit", "Glob", "Grep"],
checkpoint_path=Path(".autoresearch/checkpoint.json"),
scope_paths=["src/agents/", "prompts/"], # limit git ops to these paths
)
result = await optimizer.optimize()
Key Features¶
Stratified subsampling -- uses stratify_column to ensure balanced evaluation across categories. Progressive difficulty oversamples from historically weak categories.
Experiment insights -- each result is classified as worked/failed/promising and injected into the next iteration's prompt so the agent learns from history.
Failure analysis -- per-category score breakdowns are computed and fed back to the agent.
Rollback on deep regression -- if scores drop by more than 15% for 3 consecutive iterations, reverts to the best checkpoint.
Checkpoint resume -- pass checkpoint_path to resume interrupted runs. The optimizer rebuilds experiment insights from history on resume.
Scope isolation -- scope_paths limits git keep/discard to specific directories, protecting other agents' concurrent work.
macOS only
AutoResearchOptimizer requires the Claude Agent SDK, which is currently macOS-only. The agent runs without a shell tool by default -- quality checks are executed by the optimizer process, not the agent.
Analyze and Compare Results¶
Trajectory Analysis¶
from latent.optimize import analyze_trajectory
report = analyze_trajectory(result)
# report.metrics includes:
# - improvement: delta between first and last kept score
# - kept_rate: fraction of experiments kept (with Wilson CI)
# - convergence_delta: score change over last 20% of experiments
Compare Two Optimizers¶
from latent.optimize import compare_optimizers
comparison = compare_optimizers(result_dspy, result_ace)
# ComparisonResult with delta, CI, p-value, effect size
Visualization¶
from latent.optimize import plot_optimization_progress, plot_compare_optimizers
# Single run: green=kept, grey=discarded, red=failed, navy line=running best
fig = plot_optimization_progress(result, title="DSPy MIPROv2")
# Overlay multiple runs
fig = plot_compare_optimizers(
[result_dspy, result_ace, result_combined],
labels=["DSPy", "ACE", "Combined"],
)
Common Pattern: Optimize, Apply, Evaluate¶
The standard workflow across all optimizers:
from latent.optimize import DSPyOptimizer, OptimizedPrompt, apply_optimized_prompt
# 1. Optimize
opt = DSPyOptimizer(agent, teleprompter="mipro_v2")
result = opt.optimize(train_data=train, metric=my_metric, max_iterations=10)
# 2. Extract best prompt
best = OptimizedPrompt.from_best(result)
# 3. Save artifact for reproducibility
best.save(Path("artifacts/optimized_prompt.json"))
# 4. Apply to agent
apply_optimized_prompt(agent, best)
# 5. Evaluate on held-out data
from latent.flows.judge_flow import judge_flow
eval_result = judge_flow(eval_data=test_df, judge=agent, gates={"quality": 3.5})
Loading a Saved Prompt¶
prompt = OptimizedPrompt.from_file(Path("artifacts/optimized_prompt.json"))
apply_optimized_prompt(agent, prompt)
ExperimentTracker¶
All optimizers use ExperimentTracker for thread-safe recording, checkpointing, and MLflow integration:
from latent.optimize import ExperimentTracker
tracker = ExperimentTracker(
lower_is_better=False,
checkpoint_path=Path("checkpoints/run.json"),
auto_checkpoint=True,
on_experiment=lambda exp: print(f"{exp.label}: {exp.score:.4f}"),
)
# Share a tracker across optimizers
opt = DSPyOptimizer(agent, tracker=tracker)
Resume from a checkpoint: