Skip to content

Context Engineering

Context engineering is the discipline of managing what goes into an LLM's context window -- and when. Token budgets are finite, attention is U-shaped (models attend strongly to the beginning and end of their context but weakly to the middle), and once you exceed the limit, performance doesn't degrade gracefully -- it falls off a cliff.

Latent provides a full toolkit for monitoring, compacting, and validating LLM context:

  • Static validators run every turn in under 5 ms -- no LLM calls.
  • Active compactors transform context when token budgets tighten.
  • Semantic validators use an LLM to audit context quality (periodic or offline).
  • Utility functions for token estimation, budget breakdown, and history compaction.
  • review_context() is an offline linter that runs all static validators and returns a structured report.

Quick Start

Two ways to add context checks to your agents: the @context_check decorator (on agent subclasses) or GuardrailMiddleware (programmatic wrapping).

from latent.agents import LiteLLMAgent
from latent.context import context_check
from latent.guardrails.scanners.context import (
    TokenBudgetAuditor,
    ObservationMaskingScanner,
)

class MyAgent(LiteLLMAgent):
    @context_check(timing="pre", outcome="passive", tier="static")
    def budget_audit(self):
        return TokenBudgetAuditor(model_limit=128_000)

    @context_check(timing="pre", outcome="active", tier="static")
    def auto_compact(self):
        return ObservationMaskingScanner(keep_last_n=3)
from latent.agents import LiteLLMAgent
from latent.guardrails import GuardrailMiddleware
from latent.guardrails.scanners.context import (
    TokenBudgetAuditor,
    CompactionScanner,
)

agent = LiteLLMAgent(name="assistant", model="gpt-4o")

wrapped = GuardrailMiddleware(
    agent,
    pre_scanners=[
        TokenBudgetAuditor(model_limit=128_000),
        CompactionScanner(
            trigger_utilization=0.8,
            target_utilization=0.6,
            model_limit=128_000,
        ),
    ],
)

@context_check Decorator

Marks a method on a BaseAgent subclass as a context engineering check. Works like @guardrail but adds tier and trigger parameters and emits ContextCheckViolation / ContextCheckEvent instead of guardrail events.

from latent.context import context_check

class MyAgent(LiteLLMAgent):
    @context_check(
        timing="pre",
        outcome="active",
        tier="static",
        trigger=0.8,
        message="Context too large, compacting.",
    )
    def budget_gate(self, messages):
        total = sum(len(m.get("content", "").encode("utf-8")) // 4 for m in messages)
        return total > 100_000  # True = violation

Parameters

Parameter Type Default Description
timing "pre" / "post" "pre" When to run: before or after generation
outcome "active" / "passive" "active" Active blocks/compacts; passive logs only
threshold float 0.5 Score threshold for float-returning rules
on_error "ignore" / "raise" "ignore" Error handling strategy
message str "Request blocked by context check." User-facing message when blocked
every int 0 Post rules: 0 = end-of-stream only, N = every N tokens
tier "static" / "semantic" "static" Classifies the check type for tracing
trigger float 0.5 Trigger threshold (stored on scanner for downstream use)

Signature Detection

The decorator infers the check type from the method signature:

Signature Type Behavior
(self) Factory Called once at init; must return a scanner instance
(self, messages: list) Pre inline Runs on each input (receives the full message array)
(self, messages, output) Post inline Runs on each output

Return Types

Return type Semantics
-> bool True = violation, False = pass
-> float Score compared against threshold; >= threshold = violation
-> ScanResult Full control -- pass through directly

Static Validators

Fast checks (<5 ms, no LLM calls) designed to run on every turn.

TokenBudgetAuditor

Checks overall context utilization against model limits. Reports severity at three thresholds.

from latent.guardrails.scanners.context import TokenBudgetAuditor

scanner = TokenBudgetAuditor(
    model_limit=200_000,    # context window size
    warn_pct=0.7,           # log warning above 70%
    compact_pct=0.8,        # fail (trigger compaction) above 80%
    critical_pct=0.9,       # critical severity above 90%
)
result = scanner.scan_messages(messages, tools)
# result["score"] = utilization ratio (0.0-1.0)
# result["metadata"]["severity"] = "ok" | "warn" | "compact" | "critical"
Parameter Type Default Description
model_limit int 200_000 Context window size in tokens
warn_pct float 0.7 Warning threshold
compact_pct float 0.8 Compaction trigger threshold
critical_pct float 0.9 Critical threshold

MiddleContentDetector

Flags critical instructions that fall in the attention trough (the middle 80% of the context, between the 10th and 90th percentile by character position). LLMs attend most strongly to the beginning and end.

from latent.guardrails.scanners.context import MiddleContentDetector

scanner = MiddleContentDetector(
    critical_patterns=["must", "never", "always", "constraint", "do not", "important"],
    min_total_tokens=4000,  # skip check for short contexts
)
result = scanner.scan_messages(messages)
# result["metadata"]["findings"] = [{"pattern": "must", "position_pct": 0.45, "message_role": "system"}]
Parameter Type Default Description
critical_patterns list[str] ["must", "never", "always", "constraint", "do not", "important"] Patterns to search for
min_total_tokens int 4000 Minimum context size before checking

HistoryBloatDetector

Fails if user/assistant history consumes more than max_history_pct of total tokens. A bloated history crowds out system prompts and tool definitions.

from latent.guardrails.scanners.context import HistoryBloatDetector

scanner = HistoryBloatDetector(max_history_pct=0.6)
result = scanner.scan_messages(messages, tools)
# result["metadata"]["history_pct"] = 0.72
Parameter Type Default Description
max_history_pct float 0.6 Maximum acceptable history proportion

KVCacheStabilityAuditor

Detects dynamic values in the system prompt that break KV-cache reuse across requests. Scans for ISO timestamps, UUIDs, session IDs, request counters, and version strings with patch components.

from latent.guardrails.scanners.context import KVCacheStabilityAuditor

scanner = KVCacheStabilityAuditor()
result = scanner.scan_messages(messages)
# result["metadata"]["cache_breakers"] = [{"pattern": "iso_timestamp", "matched_text": "2025-01-15T10:30:00"}]

Fix: move dynamic values to user messages

Instead of injecting timestamps or session IDs into the system prompt, pass them in a user message or tool result. The system prompt prefix then stays identical across requests, enabling KV-cache hits.

ToolDescriptionLinter

Validates tool definitions for quality. Checks each tool for:

  • Description exists and is substantial (>10 chars)
  • Description mentions what it returns
  • Parameter count is 8 or fewer
  • Name follows verb_noun convention
from latent.guardrails.scanners.context import ToolDescriptionLinter

scanner = ToolDescriptionLinter()
result = scanner.scan_messages(messages, tools)
# result["metadata"]["tool_findings"] = [{"tool_name": "getData", "issues": ["Name does not follow verb_noun convention"]}]

SystemPromptStructureAuditor

Checks system prompt structure for best practices:

  • Identity statement in the first 200 characters (e.g., "You are a...")
  • Edge-anchored constraints -- critical directives should appear in both the first 10% and last 10% of the prompt
  • Altitude consistency -- paragraphs should not mix high-level directives ("You must always...") with implementation details (code fences, URLs)
from latent.guardrails.scanners.context import SystemPromptStructureAuditor

scanner = SystemPromptStructureAuditor()
result = scanner.scan_messages(messages)
# result["metadata"]["issues"] = ["Critical constraints not edge-anchored (missing at end of prompt)"]

Active Compactors

Scanners that transform context by returning rewritten_messages on ScanResult. The middleware replaces the original message array with the rewritten version before forwarding to the agent.

ObservationMaskingScanner

Replaces old tool outputs with one-line summaries, keeping the most recent keep_last_n tool results intact.

from latent.guardrails.scanners.context import ObservationMaskingScanner

scanner = ObservationMaskingScanner(keep_last_n=3)
result = scanner.scan_messages(messages)
# Older tool outputs become: "[Tool output masked -- 1250 tokens]"
# result["rewritten_messages"] contains the compacted message array
Parameter Type Default Description
keep_last_n int 3 Number of recent tool outputs to preserve verbatim

CompactionScanner

Triggers full context compaction when token utilization exceeds a threshold. Keeps system messages intact, preserves the most recent non-system messages that fit within the target budget, and injects a summary of dropped messages.

from latent.guardrails.scanners.context import CompactionScanner

scanner = CompactionScanner(
    trigger_utilization=0.8,    # compact when above 80%
    target_utilization=0.6,     # compact down to 60%
    model_limit=200_000,
)
result = scanner.scan_messages(messages)
# result["metadata"]["tokens_before"] and result["metadata"]["tokens_after"]
Parameter Type Default Description
trigger_utilization float 0.8 Utilization ratio that triggers compaction
target_utilization float 0.6 Target utilization after compaction
model_limit int 200_000 Context window size in tokens

ToolOutputOffloadScanner

Saves large tool outputs to scratch files and replaces them with a summary and file path reference in the message array.

from latent.guardrails.scanners.context import ToolOutputOffloadScanner

scanner = ToolOutputOffloadScanner(
    max_output_tokens=2000,          # offload outputs larger than this
    scratch_dir="/tmp/agent_scratch", # where to save files
)
result = scanner.scan_messages(messages)
# Large outputs become: "[Output saved to /tmp/agent_scratch/search_0.txt. 5200 tokens. Summary: ...]"
Parameter Type Default Description
max_output_tokens int 2000 Token threshold for offloading
scratch_dir str \| None auto (temp dir) Directory for offloaded files

SummaryInjectionScanner

Injects a summary system message into long conversations. Activates after every_n_messages non-system messages and injects a one-per-conversation summary (idempotent -- skips if a summary already exists).

from latent.guardrails.scanners.context import SummaryInjectionScanner

scanner = SummaryInjectionScanner(every_n_messages=20)
result = scanner.scan_messages(messages)
# Injects a "[Conversation summary]" system message after existing system messages
Parameter Type Default Description
every_n_messages int 20 Minimum non-system messages before injecting

Semantic Validators

LLM-backed validators for deeper analysis. These make API calls and should be used periodically, offline, or in CI -- not on every turn.

PoisoningDetector

Detects hallucinated facts that re-enter the context. Compares tool outputs against assistant messages to find unverified claims being repeated.

from latent.guardrails.scanners.context import PoisoningDetector

scanner = PoisoningDetector(model="gpt-4o-mini")
result = scanner.scan_messages(messages)
# result["metadata"]["unverified_claims"] = 2
# result["metadata"]["examples"] = ["The API supports batch mode"]
Parameter Type Default Description
model str "gpt-4o-mini" Model for fact-verification

DistractionScorer

Scores each message for relevance to the current task objective. High distraction scores indicate off-topic content that could confuse the model.

from latent.guardrails.scanners.context import DistractionScorer

scanner = DistractionScorer(model="gpt-4o-mini")
result = scanner.scan_messages(messages)
# result["score"] = average distraction score (0.0 = relevant, 1.0 = off-topic)
# result["metadata"]["per_message_scores"] = [0.1, 0.0, 0.8, ...]
Parameter Type Default Description
model str "gpt-4o-mini" Model for relevance scoring

ContradictionDetector

Finds factual contradictions between system instructions and tool outputs. Returns severity-scored contradiction pairs.

from latent.guardrails.scanners.context import ContradictionDetector

scanner = ContradictionDetector(model="gpt-4o-mini")
result = scanner.scan_messages(messages)
# result["metadata"]["contradiction_pairs"] = [
#   {"system_claim": "...", "tool_claim": "...", "severity": 0.9}
# ]
Parameter Type Default Description
model str "gpt-4o-mini" Model for contradiction detection

CompressionQualityAuditor

Evaluates quality of compressed context using probe questions across four dimensions:

  • recall -- Can specific facts from earlier be recalled?
  • artifact -- Are references to files, URLs, and code still traceable?
  • continuation -- Is there enough context to continue coherently?
  • decision -- Can key decisions and their rationale be identified?
from latent.guardrails.scanners.context import CompressionQualityAuditor

scanner = CompressionQualityAuditor(
    model="gpt-4o-mini",
    probes=["recall", "artifact", "continuation", "decision"],
)
result = scanner.scan_messages(messages)
# result["metadata"]["dimension_scores"] = {"recall": 4, "artifact": 3, ...}
# result["metadata"]["rationale"] = {"recall": "Key facts are preserved...", ...}
Parameter Type Default Description
model str "gpt-4o-mini" Model for quality probing
probes list[str] ["recall", "artifact", "continuation", "decision"] Probe dimensions

Utility Functions

Lower-level functions for token estimation and context manipulation. All functions return new lists and never mutate input.

estimate_tokens

UTF-8 byte estimation: len(text.encode("utf-8")) // 4. Handles multilingual text better than character counting (Hebrew characters are 2 bytes each, so char-count would under-estimate by ~2x).

from latent.context import estimate_tokens

tokens = estimate_tokens("Hello, world!")  # ~3
tokens = estimate_tokens("shalom olam")  # correct for Hebrew

budget_breakdown

Compute per-component token allocation from a messages array.

from latent.context import budget_breakdown

breakdown = budget_breakdown(messages, tools, model_limit=128_000)
print(f"System: {breakdown.system_tokens}")
print(f"History: {breakdown.history_tokens}")
print(f"Tool defs: {breakdown.tool_def_tokens}")
print(f"Tool output: {breakdown.tool_output_tokens}")
print(f"Available: {breakdown.available_tokens}")
print(f"Utilization: {breakdown.utilization:.1%}")

mask_observations

Replace old tool-result messages with one-line summaries, keeping the last keep_last_n intact.

from latent.context import mask_observations

compacted = mask_observations(messages, keep_last_n=3)
# Older tool results become: "[Tool output summarized: 4500 chars from search_web]"

compact_history

Reduce message history to fit within a target token budget. Two strategies:

  • tool_results_first (default) -- mask tool outputs oldest-first, then drop oldest user/assistant pairs.
  • oldest_first -- drop oldest non-system messages first.

System messages are never dropped.

from latent.context import compact_history

compacted = compact_history(
    messages,
    target_tokens=50_000,
    strategy="tool_results_first",
)

anchored_summarize

Structured iterative summarization with anchored sections. Keeps system messages and the last keep_last_n messages intact. Middle messages are summarized into mandatory sections:

  • Session Intent
  • Files Modified
  • Decisions Made
  • Current State
  • Next Steps
from latent.context import anchored_summarize

compacted = anchored_summarize(
    messages,
    keep_last_n=5,
    sections=None,       # use defaults
    summary_fn=my_llm,   # optional LLM-backed summarizer
)

LLM-backed summarization

Pass a summary_fn(text) -> str for higher-quality summaries. Without it, a simple extractive approach is used (first and last lines of the middle block).

compact_diffs

Cap recent diffs in iterative context, compacting older ones to one-liners. Useful for optimization loops where diffs accumulate.

from latent.context import compact_diffs

compacted = compact_diffs(context_markdown, max_recent_full=3)
# Older diffs become: "iter 1: +15/-3 lines in agent.py, tools.py"

review_context() -- Offline Linter

The CI / offline entry point. Runs all six static validators against a message array and returns a structured ContextReport.

from latent.context import review_context

report = review_context(
    messages,
    tools=tool_definitions,
    model_limit=128_000,
    include_semantic=False,  # set True for LLM-backed checks
    model="gpt-4o-mini",     # model for semantic validators
)

print(report.has_critical)   # bool
print(report.has_warnings)   # bool
print(report.render_markdown())

for finding in report.findings:
    print(f"[{finding.severity}] {finding.scanner_name}: {finding.message}")
    if finding.suggestion:
        print(f"  Fix: {finding.suggestion}")

ContextReport

Field Type Description
findings list[Finding] All findings from scanners
budget BudgetBreakdown \| None Token budget breakdown
has_critical bool Any critical-severity findings
has_warnings bool Any warning-severity findings

Finding

Field Type Description
scanner_name str Which scanner produced the finding
severity "info" / "warning" / "critical" Severity level
message str Human-readable description
suggestion str Remediation hint
metadata dict Scanner-specific data

CI Integration

Use review_context() in your test suite to catch context issues before deployment:

import pytest
from latent.context import review_context

def test_agent_context_quality():
    messages = build_test_messages()
    tools = get_agent_tools()

    report = review_context(messages, tools, model_limit=128_000)

    assert not report.has_critical, report.render_markdown()
    for finding in report.findings:
        assert finding.severity != "critical", finding.message

Events and Tracing

Context checks emit their own event types, distinct from guardrail events, so traces are clear about what triggered a check.

ContextCheckViolation

Yielded in the agent event stream when a context check fires. Extends AgentEvent.

Field Type Description
check_name str Name of the check
timing "pre" / "post" When it fired
outcome "active" / "passive" Whether it blocks
tier "static" / "semantic" Check classification
score float Check score
message str User-facing message
tokens_before int \| None Token count before compaction
tokens_after int \| None Token count after compaction
findings list[dict] Detailed findings

ContextCheckEvent

Emitted to sinks (logging/tracing backends). Carries timing, latency, and metadata for observability.

Field Type Description
event_type str "check_result" / "compaction" / "warning" / "critical"
check_name str Name of the check
tier "static" / "semantic" Check classification
score float \| None Check score
passed bool \| None Whether the check passed
tokens_before int \| None Token count before
tokens_after int \| None Token count after
latency_ms float \| None Check execution time

Consuming Events

from latent.agents.events import TextDelta
from latent.context.events import ContextCheckViolation

async for event in agent.stream(messages):
    if isinstance(event, ContextCheckViolation):
        if event.outcome == "active":
            print(f"[CONTEXT] {event.check_name}: {event.message}")
            if event.tokens_before and event.tokens_after:
                print(f"  Compacted: {event.tokens_before:,} -> {event.tokens_after:,} tokens")
        else:
            print(f"[CONTEXT WARNING] {event.check_name}: {event.message}")
    elif isinstance(event, TextDelta):
        print(event.text, end="", flush=True)

Production Architecture

A recommended layering for production agents:

Every turn (<5 ms)     Threshold-triggered        Periodic / CI
---------------------  -------------------------  -------------------------
TokenBudgetAuditor     CompactionScanner          PoisoningDetector
MiddleContentDetector  ObservationMaskingScanner   DistractionScorer
HistoryBloatDetector   ToolOutputOffloadScanner    ContradictionDetector
KVCacheStabilityAuditor SummaryInjectionScanner    CompressionQualityAuditor
ToolDescriptionLinter
SystemPromptStructureAuditor

Static validators are passive -- they log findings but don't block. Run them on every turn for continuous monitoring.

Active compactors fire only when utilization crosses a threshold. Set them as active/pre scanners so they rewrite the message array before the LLM call.

Semantic validators are expensive (LLM calls). Run them offline, in CI, or on a schedule -- not in the hot path.

from latent.agents import LiteLLMAgent
from latent.context import context_check
from latent.guardrails.scanners.context import (
    TokenBudgetAuditor,
    MiddleContentDetector,
    HistoryBloatDetector,
    KVCacheStabilityAuditor,
    ObservationMaskingScanner,
    CompactionScanner,
)

class ProductionAgent(LiteLLMAgent):
    # --- Static validators (passive, every turn) ---
    @context_check(timing="pre", outcome="passive", tier="static")
    def budget_monitor(self):
        return TokenBudgetAuditor(model_limit=128_000)

    @context_check(timing="pre", outcome="passive", tier="static")
    def middle_content(self):
        return MiddleContentDetector()

    @context_check(timing="pre", outcome="passive", tier="static")
    def history_bloat(self):
        return HistoryBloatDetector(max_history_pct=0.6)

    @context_check(timing="pre", outcome="passive", tier="static")
    def kv_cache(self):
        return KVCacheStabilityAuditor()

    # --- Active compactors (trigger at threshold) ---
    @context_check(timing="pre", outcome="active", tier="static")
    def auto_mask(self):
        return ObservationMaskingScanner(keep_last_n=5)

    @context_check(timing="pre", outcome="active", tier="static")
    def auto_compact(self):
        return CompactionScanner(
            trigger_utilization=0.8,
            target_utilization=0.6,
            model_limit=128_000,
        )

Integration with Agent Types

LiteLLMAgent

Use @context_check decorators on a subclass (shown above) or wrap with GuardrailMiddleware.

PipelineAgent

Pipeline agents use the same GuardrailMiddleware infrastructure. Context checks run before each phase's LLM call:

from latent.agents.pipeline import PipelineAgent, phase
from latent.context import context_check
from latent.guardrails.scanners.context import TokenBudgetAuditor

class MyPipeline(PipelineAgent):
    @context_check(timing="pre", outcome="passive", tier="static")
    def budget_check(self):
        return TokenBudgetAuditor(model_limit=128_000)

    @phase("classify")
    async def classify(self, state):
        ...

Programmatic Wrapping

For agents you don't control, wrap them:

from latent.guardrails import GuardrailMiddleware
from latent.guardrails.scanners.context import (
    TokenBudgetAuditor,
    ObservationMaskingScanner,
    CompactionScanner,
)

wrapped = GuardrailMiddleware(
    third_party_agent,
    pre_scanners=[
        TokenBudgetAuditor(model_limit=128_000),
        ObservationMaskingScanner(keep_last_n=3),
        CompactionScanner(
            trigger_utilization=0.8,
            target_utilization=0.6,
            model_limit=128_000,
        ),
    ],
)

Scanner ordering matters

Pre-scanners run in order. Place validators before compactors so monitoring metrics reflect the pre-compaction state.