Skip to content

Latent

Opinionated framework for AI agent evaluation workflows.

Latent is a batteries-included Python platform for building, evaluating, and optimizing AI agents. It brings together flow orchestration, LLM-as-judge evaluation, guardrails, prompt optimization, RAG pipelines, and rigorous statistical analysis under one coherent API -- all wired to MLflow for experiment tracking out of the box.

Installation

pip install latent

Or with uv (recommended):

uv add latent

Optional extras unlock additional capabilities:

pip install latent[eval]        # Full eval platform (Prefect, pandas, scipy)
pip install latent[tracking]    # MLflow experiment tracking
pip install latent[rag]         # RAG + RAPTOR retrieval
pip install latent[optimizers]  # DSPy + ACE prompt optimization
pip install latent[guardrails]  # Input/output guardrail scanners
pip install latent[chat]        # Interactive TUI agent chat

Quick Example

Score free-text outputs with an LLM judge and get statistical confidence intervals:

from typing import Annotated
from latent.agents import Judge
from latent.agents.scores import ScoredModel, OrdinalScore
from latent.flows.judge_flow import judge_flow

class QAScores(ScoredModel):
    quality: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]
    faithfulness: Annotated[int, OrdinalScore(scale=(1, 2, 3, 4, 5), pass_threshold=3)]

judge = Judge("qa_scorer", model="gpt-4o", output_type=QAScores)
result = judge_flow(eval_data=df, judge=judge, gates={"quality": 3.5})

print(result["markdown"])  # Statistical report with bootstrap CIs

Platform Capabilities

Orchestration

  • @flow / @task decorators -- declarative evaluation pipelines with auto-config, caching, retries, and structured logging
  • Data catalog -- YAML-driven dataset management with schema validation and cross-flow references
  • Infrastructure CLI -- latent infra up starts PostgreSQL, Prefect, MLflow, and workers in one command

Flows and Tasks CLI Reference

Eval Agents

  • Judge -- score any data row against a Pydantic output schema with auto-generated rubrics
  • Classifier -- semantic classification with accuracy, F1, precision, and recall metrics
  • ScoredModel -- composable score annotations (ordinal, binary, continuous) with automatic rationale fields

Eval Agents

Pre-built Eval Flows

Ready-to-use pipelines for common evaluation patterns:

Flow Purpose
judge_flow Score free-text outputs with LLM judge
classification_flow Classify and compute accuracy/F1/precision/recall
comparison_flow Compare model A vs B with statistical tests
conversation_scoring_flow Turn-level conversation quality scoring
conversation_sop_flow SOP compliance checking
drift_flow Distribution drift detection across runs

Pre-built Flows

Guardrails

  • @guardrail decorator -- attach pre/post scanning rules directly to agent classes
  • Middleware pipeline -- composable GuardrailMiddleware wrapping any agent
  • Built-in scanners -- language detection, invisible text, token limits, LLM-backed custom rules

Guardrails

Prompt Optimization

  • DSPyOptimizer -- automatic prompt and few-shot demo optimization via DSPy teleprompters
  • ACEOptimizer -- section-level prompt refinement using the ACE framework
  • AutoResearchOptimizer -- full codebase optimization with Claude Agent SDK and git-based keep/discard loops

Optimization

RAG Pipelines

  • ChromaAdapter -- vector store integration with automatic embedding
  • HybridRetriever -- combined dense + BM25 sparse retrieval
  • RAGAgent -- config-driven agent with pluggable retrieval pipeline
  • RAPTOR -- recursive abstractive processing for tree-organized retrieval

RAG

Statistical Analysis

Rigorous statistical toolkit for evaluation results:

  • Bootstrap confidence intervals and paired comparisons
  • Wilson intervals for binary proportions, Bayesian posteriors
  • McNemar's test, non-inferiority testing, effect sizes
  • Classification metrics with CIs, confusion matrices
  • Quality gates, drift detection, inter-rater reliability
  • Conversation trajectory analysis and SOP compliance scoring

Statistical Analysis

Experiment Tracking

  • MLflow integration -- automatic experiment creation, metric logging, and artifact storage
  • Safe wrapper -- all MLflow calls are no-ops when tracking is disabled
  • Flow-aware spans -- nested task execution traced automatically

MLflow Tracking

Next Steps