Latent¶
Opinionated framework for AI agent evaluation workflows.
Latent is a batteries-included Python platform for building, evaluating, and optimizing AI agents. It brings together flow orchestration, LLM-as-judge evaluation, guardrails, prompt optimization, RAG pipelines, and rigorous statistical analysis under one coherent API -- all wired to MLflow for experiment tracking out of the box.
Installation¶
Or with uv (recommended):
Optional extras unlock additional capabilities:
pip install latent[eval] # Full eval platform (Prefect, pandas, scipy)
pip install latent[tracking] # MLflow experiment tracking
pip install latent[rag] # RAG + RAPTOR retrieval
pip install latent[optimizers] # DSPy + ACE prompt optimization
pip install latent[guardrails] # Input/output guardrail scanners
pip install latent[chat] # Interactive TUI agent chat
Quick Example¶
Score free-text outputs with an LLM judge and get statistical confidence intervals:
from typing import Annotated
from latent.agents import Judge
from latent.agents.scores import ScoredModel, OrdinalScore
from latent.flows.judge_flow import judge_flow
class QAScores(ScoredModel):
quality: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]
faithfulness: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]
judge = Judge("qa_scorer", model="gpt-4o", output_type=QAScores)
result = judge_flow(eval_data=df, judge=judge, gates={"quality": 3.5})
print(result["markdown"]) # Statistical report with bootstrap CIs
Platform Capabilities¶
Orchestration¶
@flow/@taskdecorators -- declarative evaluation pipelines with auto-config, caching, retries, and structured logging- Data catalog -- YAML-driven dataset management with schema validation and cross-flow references
- Infrastructure CLI --
latent infra upstarts PostgreSQL, Prefect, MLflow, and workers in one command
Eval Agents¶
- Judge -- score any data row against a Pydantic output schema with auto-generated rubrics
- Classifier -- semantic classification with accuracy, F1, precision, and recall metrics
- ScoredModel -- composable score annotations (ordinal, binary, continuous) with automatic rationale fields
Pre-built Eval Flows¶
Ready-to-use pipelines for common evaluation patterns:
| Flow | Purpose |
|---|---|
judge_flow |
Score free-text outputs with LLM judge |
classification_flow |
Classify and compute accuracy/F1/precision/recall |
comparison_flow |
Compare model A vs B with statistical tests |
conversation_scoring_flow |
Turn-level conversation quality scoring |
conversation_sop_flow |
SOP compliance checking |
drift_flow |
Distribution drift detection across runs |
Guardrails¶
@guardraildecorator -- attach pre/post scanning rules directly to agent classes- Middleware pipeline -- composable
GuardrailMiddlewarewrapping any agent - Built-in scanners -- language detection, invisible text, token limits, LLM-backed custom rules
Prompt Optimization¶
- DSPyOptimizer -- automatic prompt and few-shot demo optimization via DSPy teleprompters
- ACEOptimizer -- section-level prompt refinement using the ACE framework
- AutoResearchOptimizer -- full codebase optimization with Claude Agent SDK and git-based keep/discard loops
RAG Pipelines¶
- ChromaAdapter -- vector store integration with automatic embedding
- HybridRetriever -- combined dense + BM25 sparse retrieval
- RAGAgent -- config-driven agent with pluggable retrieval pipeline
- RAPTOR -- recursive abstractive processing for tree-organized retrieval
Statistical Analysis¶
Rigorous statistical toolkit for evaluation results:
- Bootstrap confidence intervals and paired comparisons
- Wilson intervals for binary proportions, Bayesian posteriors
- McNemar's test, non-inferiority testing, effect sizes
- Classification metrics with CIs, confusion matrices
- Quality gates, drift detection, inter-rater reliability
- Conversation trajectory analysis and SOP compliance scoring
Experiment Tracking¶
- MLflow integration -- automatic experiment creation, metric logging, and artifact storage
- Safe wrapper -- all MLflow calls are no-ops when tracking is disabled
- Flow-aware spans -- nested task execution traced automatically
Next Steps¶
- Getting Started Guide -- full setup walkthrough and first evaluation pipeline
- Flow Building and Style -- project structure, colocation, and best practices
- Workspace Configuration -- configure paths and environment
- Examples -- real-world usage patterns
- API Reference -- complete module documentation