Latent¶

Opinionated framework for AI agent evaluation workflows.

Latent is a batteries-included Python platform for building, evaluating, and optimizing AI agents. It brings together flow orchestration, LLM-as-judge evaluation, guardrails, prompt optimization, RAG pipelines, and rigorous statistical analysis under one coherent API -- all wired to MLflow for experiment tracking out of the box.

Installation¶

pip install latent

Or with uv (recommended):

uv add latent

Optional extras unlock additional capabilities:

pip install latent[eval]        # Full eval platform (Prefect, pandas, scipy)
pip install latent[tracking]    # MLflow experiment tracking
pip install latent[rag]         # RAG + RAPTOR retrieval
pip install latent[optimizers]  # DSPy + ACE prompt optimization
pip install latent[guardrails]  # Input/output guardrail scanners
pip install latent[chat]        # Interactive TUI agent chat

Quick Example¶

Score free-text outputs with an LLM judge and get statistical confidence intervals:

from typing import Annotated
from latent.agents import Judge
from latent.agents.scores import ScoredModel, OrdinalScore
from latent.flows.judge_flow import judge_flow

class QAScores(ScoredModel):
    quality: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]
    faithfulness: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]

judge = Judge("qa_scorer", model="gpt-4o", output_type=QAScores)
result = judge_flow(eval_data=df, judge=judge, gates={"quality": 3.5})

print(result["markdown"])  # Statistical report with bootstrap CIs

Platform Capabilities¶

Orchestration¶

@flow / @task decorators -- declarative evaluation pipelines with auto-config, caching, retries, and structured logging
Data catalog -- YAML-driven dataset management with schema validation and cross-flow references
Infrastructure CLI -- latent infra up starts PostgreSQL, Prefect, MLflow, and workers in one command

Flows and Tasks CLI Reference

Eval Agents¶

Judge -- score any data row against a Pydantic output schema with auto-generated rubrics
Classifier -- semantic classification with accuracy, F1, precision, and recall metrics
ScoredModel -- composable score annotations (ordinal, binary, continuous) with automatic rationale fields

Eval Agents

Pre-built Eval Flows¶

Ready-to-use pipelines for common evaluation patterns:

Flow	Purpose
`judge_flow`	Score free-text outputs with LLM judge
`classification_flow`	Classify and compute accuracy/F1/precision/recall
`comparison_flow`	Compare model A vs B with statistical tests
`conversation_scoring_flow`	Turn-level conversation quality scoring
`conversation_sop_flow`	SOP compliance checking
`drift_flow`	Distribution drift detection across runs

Pre-built Flows

Context Engineering¶

@context_check decorator -- monitor and manage context window on agent classes
Static validators -- token budget, attention trough, history bloat, KV-cache stability, tool linting, prompt structure (<5 ms)
Active compactors -- observation masking, auto-compaction, tool output offload, summary injection
Semantic validators -- LLM-backed poisoning detection, distraction scoring, contradiction detection
review_context() -- offline linter for CI integration

Context Engineering

Guardrails¶

@guardrail decorator -- attach pre/post scanning rules directly to agent classes
Middleware pipeline -- composable GuardrailMiddleware wrapping any agent
Built-in scanners -- language detection, invisible text, token limits, LLM-backed custom rules

Guardrails

Prompt Optimization¶

DSPyOptimizer -- automatic prompt and few-shot demo optimization via DSPy teleprompters
ACEOptimizer -- section-level prompt refinement using the ACE framework
AutoResearchOptimizer -- full codebase optimization with Claude Agent SDK and git-based keep/discard loops

Optimization

RAG Pipelines¶

ChromaAdapter -- vector store integration with automatic embedding
HybridRetriever -- combined dense + BM25 sparse retrieval
RAGAgent -- config-driven agent with pluggable retrieval pipeline
RAPTOR -- recursive abstractive processing for tree-organized retrieval

RAG

Statistical Analysis¶

Rigorous statistical toolkit for evaluation results:

Bootstrap confidence intervals and paired comparisons
Wilson intervals for binary proportions, Bayesian posteriors
McNemar's test, non-inferiority testing, effect sizes
Classification metrics with CIs, confusion matrices
Quality gates, drift detection, inter-rater reliability
Conversation trajectory analysis and SOP compliance scoring

Statistical Analysis

Experiment Tracking¶

MLflow integration -- automatic experiment creation, metric logging, and artifact storage
Safe wrapper -- all MLflow calls are no-ops when tracking is disabled
Flow-aware spans -- nested task execution traced automatically

MLflow Tracking

Next Steps¶

Getting Started Guide -- full setup walkthrough and first evaluation pipeline
Flow Building and Style -- project structure, colocation, and best practices
Workspace Configuration -- configure paths and environment
Examples -- real-world usage patterns
API Reference -- complete module documentation