Context Engineering¶

Context engineering is the discipline of managing what goes into an LLM's context window -- and when. Token budgets are finite, attention is U-shaped (models attend strongly to the beginning and end of their context but weakly to the middle), and once you exceed the limit, performance doesn't degrade gracefully -- it falls off a cliff.

Latent provides a full toolkit for monitoring, compacting, and validating LLM context:

Static validators run every turn in under 5 ms -- no LLM calls.
Active compactors transform context when token budgets tighten.
Semantic validators are defined as scanner classes that use an LLM to audit context quality. They can be run directly (instantiate and await scanner.scan_messages(...)), but are not yet wired into review_context() — see the note below.
Utility functions for token estimation, budget breakdown, and history compaction.
review_context() is an offline linter that runs all static validators and returns a structured report.

Semantic scanners are not run by review_context()

review_context(..., include_semantic=True) does not currently execute the semantic (LLM-backed) scanners. The _run_semantic_scanners hook is a stub that returns [] ("not yet implemented") — only the static validators run. The semantic scanner classes documented below are fully functional when called directly, but passing include_semantic=True is currently a no-op.

Quick Start¶

Two ways to add context checks to your agents: the @context_check decorator (on agent subclasses) or GuardrailMiddleware (programmatic wrapping).

DecoratorMiddleware

from latent.agents import ReActAgent
from latent.context import context_check
from latent.guardrails.scanners.context import (
    TokenBudgetAuditor,
    ObservationMaskingScanner,
)

class MyAgent(ReActAgent):
    @context_check(timing="pre", outcome="passive", tier="static")
    def budget_audit(self):
        return TokenBudgetAuditor(model_limit=128_000)

    @context_check(timing="pre", outcome="active", tier="static")
    def auto_compact(self):
        return ObservationMaskingScanner(keep_last_n=3)

from latent.agents import ReActAgent
from latent.guardrails import GuardrailMiddleware
from latent.guardrails.scanners.context import (
    TokenBudgetAuditor,
    CompactionScanner,
)

agent = ReActAgent(name="assistant", model="gpt-4o")

wrapped = GuardrailMiddleware(
    agent,
    pre_scanners=[
        TokenBudgetAuditor(model_limit=128_000),
        CompactionScanner(
            trigger_utilization=0.8,
            target_utilization=0.6,
            model_limit=128_000,
        ),
    ],
)

`@context_check` Decorator¶

Marks a method on a BaseAgent subclass as a context engineering check. Works like @guardrail but adds tier and trigger parameters and emits ContextCheckViolation / ContextCheckEvent instead of guardrail events.

from latent.context import context_check

class MyAgent(ReActAgent):
    @context_check(
        timing="pre",
        outcome="active",
        tier="static",
        trigger=0.8,
        message="Context too large, compacting.",
    )
    async def budget_gate(self, messages):
        total = sum(len(m.get("content", "").encode("utf-8")) // 4 for m in messages)
        return total > 100_000  # True = violation

Parameters¶

Parameter	Type	Default	Description
`timing`	`"pre"` / `"post"`	`"pre"`	When to run: before or after generation
`outcome`	`"active"` / `"passive"`	`"active"`	Active blocks/compacts; passive logs only
`threshold`	`float`	`0.5`	Score threshold for float-returning rules
`on_error`	`"ignore"` / `"raise"`	`"ignore"`	Error handling strategy
`message`	`str`	`"Request blocked by context check."`	User-facing message when blocked
`every`	`int`	`0`	Post rules: 0 = end-of-stream only, N = every N streamed chunks (TextDelta events)
`tier`	`"static"` / `"semantic"`	`"static"`	Classifies the check type for tracing
`trigger`	`float`	`0.5`	Trigger threshold (stored on scanner for downstream use)

Signature Detection¶

The decorator infers the check type from the method signature:

Signature	Type	Behavior
`(self)`	Factory	Called once at init; must return a scanner instance
`(self, messages: list)`	Pre inline	Runs on each input (receives the full message array)
`(self, messages, output)`	Post inline	Runs on each output

Return Types¶

Return type	Semantics
`-> bool`	`True` = violation, `False` = pass
`-> float`	Score compared against `threshold`; `>= threshold` = violation
`-> ScanResult`	Full control -- pass through directly

Static Validators¶

Fast checks (<5 ms, no LLM calls) designed to run on every turn.

`TokenBudgetAuditor`¶

Checks overall context utilization against model limits. Reports severity at three thresholds.

from latent.guardrails.scanners.context import TokenBudgetAuditor

scanner = TokenBudgetAuditor(
    model_limit=200_000,    # context window size
    warn_pct=0.7,           # log warning above 70%
    compact_pct=0.8,        # fail (trigger compaction) above 80%
    critical_pct=0.9,       # critical severity above 90%
)
result = await scanner.scan_messages(messages, tools)
# result["score"] = utilization ratio (0.0-1.0)
# result["metadata"]["severity"] = "ok" | "warn" | "compact" | "critical"

Parameter	Type	Default	Description
`model_limit`	`int`	`200_000`	Context window size in tokens
`warn_pct`	`float`	`0.7`	Warning threshold
`compact_pct`	`float`	`0.8`	Compaction trigger threshold
`critical_pct`	`float`	`0.9`	Critical threshold

`MiddleContentDetector`¶

Flags critical instructions that fall in the attention trough (the middle 80% of the context, between the 10th and 90th percentile by character position). LLMs attend most strongly to the beginning and end.

from latent.guardrails.scanners.context import MiddleContentDetector

scanner = MiddleContentDetector(
    critical_patterns=["must", "never", "always", "constraint", "do not", "important"],
    min_total_tokens=4000,  # skip check for short contexts
)
result = await scanner.scan_messages(messages)
# result["metadata"]["findings"] = [{"pattern": "must", "position_pct": 0.45, "message_role": "system"}]

Parameter	Type	Default	Description
`critical_patterns`	`list[str]`	`["must", "never", "always", "constraint", "do not", "important"]`	Patterns to search for
`min_total_tokens`	`int`	`4000`	Minimum context size before checking

`HistoryBloatDetector`¶

Fails if user/assistant history consumes more than max_history_pct of total tokens. A bloated history crowds out system prompts and tool definitions.

from latent.guardrails.scanners.context import HistoryBloatDetector

scanner = HistoryBloatDetector(max_history_pct=0.6)
result = await scanner.scan_messages(messages, tools)
# result["metadata"]["history_pct"] = 0.72

Parameter	Type	Default	Description
`max_history_pct`	`float`	`0.6`	Maximum acceptable history proportion

`KVCacheStabilityAuditor`¶

Detects dynamic values in the system prompt that break KV-cache reuse across requests. Scans for ISO timestamps, UUIDs, session IDs, request counters, and version strings with patch components.

from latent.guardrails.scanners.context import KVCacheStabilityAuditor

scanner = KVCacheStabilityAuditor()
result = await scanner.scan_messages(messages)
# result["metadata"]["cache_breakers"] = [{"pattern": "iso_timestamp", "matched_text": "2025-01-15T10:30:00"}]

Fix: move dynamic values to user messages

Instead of injecting timestamps or session IDs into the system prompt, pass them in a user message or tool result. The system prompt prefix then stays identical across requests, enabling KV-cache hits.

`ToolDescriptionLinter`¶

Validates tool definitions for quality. Checks each tool for:

Description exists and is substantial (>10 chars)
Description mentions what it returns
Parameter count is 8 or fewer
Name follows verb_noun convention

from latent.guardrails.scanners.context import ToolDescriptionLinter

scanner = ToolDescriptionLinter()
result = await scanner.scan_messages(messages, tools)
# result["metadata"]["tool_findings"] = [{"tool_name": "getData", "issues": ["Name does not follow verb_noun convention"]}]

`SystemPromptStructureAuditor`¶

Checks system prompt structure for best practices:

Identity statement in the first 200 characters (e.g., "You are a...")
Edge-anchored constraints -- critical directives should appear in both the first 10% and last 10% of the prompt
Altitude consistency -- paragraphs should not mix high-level directives ("You must always...") with implementation details (code fences, URLs)

from latent.guardrails.scanners.context import SystemPromptStructureAuditor

scanner = SystemPromptStructureAuditor()
result = await scanner.scan_messages(messages)
# result["metadata"]["issues"] = ["Critical constraints not edge-anchored (missing at end of prompt)"]

Active Compactors¶

Scanners that transform context by returning rewritten_messages on ScanResult. The middleware replaces the original message array with the rewritten version before forwarding to the agent.

`ObservationMaskingScanner`¶

Replaces old tool outputs with one-line summaries, keeping the most recent keep_last_n tool results intact.

from latent.guardrails.scanners.context import ObservationMaskingScanner

scanner = ObservationMaskingScanner(keep_last_n=3)
result = await scanner.scan_messages(messages)
# Older tool outputs become: "[Tool output masked -- 1250 tokens]"
# result["rewritten_messages"] contains the compacted message array

Parameter	Type	Default	Description
`keep_last_n`	`int`	`3`	Number of recent tool outputs to preserve verbatim

`CompactionScanner`¶

Triggers full context compaction when token utilization exceeds a threshold. Keeps system messages intact, preserves the most recent non-system messages that fit within the target budget, and injects a summary of dropped messages.

from latent.guardrails.scanners.context import CompactionScanner

scanner = CompactionScanner(
    trigger_utilization=0.8,    # compact when above 80%
    target_utilization=0.6,     # compact down to 60%
    model_limit=200_000,
)
result = await scanner.scan_messages(messages)
# result["metadata"]["tokens_before"] and result["metadata"]["tokens_after"]

Parameter	Type	Default	Description
`trigger_utilization`	`float`	`0.8`	Utilization ratio that triggers compaction
`target_utilization`	`float`	`0.6`	Target utilization after compaction
`model_limit`	`int`	`200_000`	Context window size in tokens

`ToolOutputOffloadScanner`¶

Saves large tool outputs to scratch files and replaces them with a summary and file path reference in the message array.

from latent.guardrails.scanners.context import ToolOutputOffloadScanner

scanner = ToolOutputOffloadScanner(
    max_output_tokens=2000,          # offload outputs larger than this
    scratch_dir="/tmp/agent_scratch", # where to save files
)
result = await scanner.scan_messages(messages)
# Large outputs become: "[Output saved to /tmp/agent_scratch/search_0.txt. 5200 tokens. Summary: ...]"

Parameter	Type	Default	Description
`max_output_tokens`	`int`	`2000`	Token threshold for offloading
`scratch_dir`	`str \\| None`	auto (temp dir)	Directory for offloaded files

`SummaryInjectionScanner`¶

Injects a summary system message into long conversations. Activates after every_n_messages non-system messages and injects a one-per-conversation summary (idempotent -- skips if a summary already exists).

from latent.guardrails.scanners.context import SummaryInjectionScanner

scanner = SummaryInjectionScanner(every_n_messages=20)
result = await scanner.scan_messages(messages)
# Injects a "[Conversation summary]" system message after existing system messages

Parameter	Type	Default	Description
`every_n_messages`	`int`	`20`	Minimum non-system messages before injecting

Semantic Validators¶

LLM-backed validators for deeper analysis. These make API calls and should be used periodically, offline, or in CI -- not on every turn.

Not wired into review_context() yet

These scanner classes exist and work when invoked directly (await scanner.scan_messages(messages)), but review_context(..., include_semantic=True) does not run them — the _run_semantic_scanners hook is currently a stub that returns []. To use them today, instantiate and call them yourself.

`PoisoningDetector`¶

Detects hallucinated facts that re-enter the context. Compares tool outputs against assistant messages to find unverified claims being repeated.

from latent.guardrails.scanners.context import PoisoningDetector

scanner = PoisoningDetector(model="gpt-4o-mini")
result = await scanner.scan_messages(messages)
# result["metadata"]["unverified_claims"] = 2
# result["metadata"]["examples"] = ["The API supports batch mode"]

Parameter	Type	Default	Description
`model`	`str`	`"gpt-4o-mini"`	Model for fact-verification

`DistractionScorer`¶

Scores each message for relevance to the current task objective. High distraction scores indicate off-topic content that could confuse the model.

from latent.guardrails.scanners.context import DistractionScorer

scanner = DistractionScorer(model="gpt-4o-mini")
result = await scanner.scan_messages(messages)
# result["score"] = average distraction score (0.0 = relevant, 1.0 = off-topic)
# result["metadata"]["per_message_scores"] = [0.1, 0.0, 0.8, ...]

Parameter	Type	Default	Description
`model`	`str`	`"gpt-4o-mini"`	Model for relevance scoring

`ContradictionDetector`¶

Finds factual contradictions between system instructions and tool outputs. Returns severity-scored contradiction pairs.

from latent.guardrails.scanners.context import ContradictionDetector

scanner = ContradictionDetector(model="gpt-4o-mini")
result = await scanner.scan_messages(messages)
# result["metadata"]["contradiction_pairs"] = [
#   {"system_claim": "...", "tool_claim": "...", "severity": 0.9}
# ]

Parameter	Type	Default	Description
`model`	`str`	`"gpt-4o-mini"`	Model for contradiction detection

`CompressionQualityAuditor`¶

Evaluates quality of compressed context using probe questions across four dimensions:

recall -- Can specific facts from earlier be recalled?
artifact -- Are references to files, URLs, and code still traceable?
continuation -- Is there enough context to continue coherently?
decision -- Can key decisions and their rationale be identified?

from latent.guardrails.scanners.context import CompressionQualityAuditor

scanner = CompressionQualityAuditor(
    model="gpt-4o-mini",
    probes=["recall", "artifact", "continuation", "decision"],
)
result = await scanner.scan_messages(messages)
# result["metadata"]["dimension_scores"] = {"recall": 4, "artifact": 3, ...}
# result["metadata"]["rationale"] = {"recall": "Key facts are preserved...", ...}

Parameter	Type	Default	Description
`model`	`str`	`"gpt-4o-mini"`	Model for quality probing
`probes`	`list[str]`	`["recall", "artifact", "continuation", "decision"]`	Probe dimensions

Utility Functions¶

Lower-level functions for token estimation and context manipulation. All functions return new lists and never mutate input.

`estimate_tokens`¶

UTF-8 byte estimation: len(text.encode("utf-8")) // 4. Handles multilingual text better than character counting (Hebrew characters are 2 bytes each, so char-count would under-estimate by ~2x).

from latent.context import estimate_tokens

tokens = estimate_tokens("Hello, world!")  # ~3
tokens = estimate_tokens("shalom olam")  # correct for Hebrew

`budget_breakdown`¶

Compute per-component token allocation from a messages array.

from latent.context import budget_breakdown

breakdown = budget_breakdown(messages, tools, model_limit=128_000)
print(f"System: {breakdown.system_tokens}")
print(f"History: {breakdown.history_tokens}")
print(f"Tool defs: {breakdown.tool_def_tokens}")
print(f"Tool output: {breakdown.tool_output_tokens}")
print(f"Available: {breakdown.available_tokens}")
print(f"Utilization: {breakdown.utilization:.1%}")

`mask_observations`¶

Replace old tool-result messages with one-line summaries, keeping the last keep_last_n intact.

from latent.context import mask_observations

compacted = mask_observations(messages, keep_last_n=3)
# Older tool results become: "[Tool output summarized: 4500 chars from search_web]"

`compact_history`¶

Reduce message history to fit within a target token budget. Two strategies:

tool_results_first (default) -- mask tool outputs oldest-first, then drop oldest user/assistant pairs.
oldest_first -- drop oldest non-system messages first.

System messages are never dropped.

from latent.context import compact_history

compacted = compact_history(
    messages,
    target_tokens=50_000,
    strategy="tool_results_first",
)

`anchored_summarize`¶

Structured iterative summarization with anchored sections. Keeps system messages and the last keep_last_n messages intact. Middle messages are summarized into mandatory sections:

Session Intent
Files Modified
Decisions Made
Current State
Next Steps

from latent.context import anchored_summarize

compacted = anchored_summarize(
    messages,
    keep_last_n=5,
    sections=None,       # use defaults
    summary_fn=my_llm,   # optional LLM-backed summarizer
)

LLM-backed summarization

Pass a summary_fn(text) -> str for higher-quality summaries. Without it, a simple extractive approach is used (first and last lines of the middle block).

`compact_diffs`¶

Cap recent diffs in iterative context, compacting older ones to one-liners. Useful for optimization loops where diffs accumulate.

from latent.context import compact_diffs

compacted = compact_diffs(context_markdown, max_recent_full=3)
# Older diffs become: "iter 1: +15/-3 lines in agent.py, tools.py"

`review_context()` -- Offline Linter¶

The CI / offline entry point. Runs all six static validators against a message array and returns a structured ContextReport. review_context() is async def, so await it (or wrap in asyncio.run).

include_semantic is currently a no-op

include_semantic=True does not yet run the semantic (LLM-backed) scanners — the _run_semantic_scanners hook is a stub returning []. Only the six static validators run regardless of this flag.

from latent.context import review_context

report = await review_context(
    messages,
    tools=tool_definitions,
    model_limit=128_000,
    include_semantic=False,  # set True for LLM-backed checks (not yet wired in)
    model="gpt-4o-mini",     # model for semantic validators
)

print(report.has_critical)   # bool
print(report.has_warnings)   # bool
print(report.render_markdown())

for finding in report.findings:
    print(f"[{finding.severity}] {finding.scanner_name}: {finding.message}")
    if finding.suggestion:
        print(f"  Fix: {finding.suggestion}")

`ContextReport`¶

Field	Type	Description
`findings`	`list[Finding]`	All findings from scanners
`budget`	`BudgetBreakdown \\| None`	Token budget breakdown
`has_critical`	`bool`	Any critical-severity findings
`has_warnings`	`bool`	Any warning-severity findings

`Finding`¶

Field	Type	Description
`scanner_name`	`str`	Which scanner produced the finding
`severity`	`"info"` / `"warning"` / `"critical"`	Severity level
`message`	`str`	Human-readable description
`suggestion`	`str`	Remediation hint
`metadata`	`dict`	Scanner-specific data

CI Integration¶

Use review_context() in your test suite to catch context issues before deployment:

import pytest
from latent.context import review_context

@pytest.mark.asyncio
async def test_agent_context_quality():
    messages = build_test_messages()
    tools = get_agent_tools()

    report = await review_context(messages, tools, model_limit=128_000)

    assert not report.has_critical, report.render_markdown()
    for finding in report.findings:
        assert finding.severity != "critical", finding.message

Events and Tracing¶

Context checks emit their own event types, distinct from guardrail events, so traces are clear about what triggered a check.

`ContextCheckViolation`¶

Yielded in the agent event stream when a context check fires. Extends AgentEvent.

Field	Type	Description
`check_name`	`str`	Name of the check
`timing`	`"pre"` / `"post"`	When it fired
`outcome`	`"active"` / `"passive"`	Whether it blocks
`tier`	`"static"` / `"semantic"`	Check classification
`score`	`float`	Check score
`message`	`str`	User-facing message
`tokens_before`	`int \\| None`	Token count before compaction
`tokens_after`	`int \\| None`	Token count after compaction
`findings`	`list[dict]`	Detailed findings

`ContextCheckEvent`¶

Emitted to sinks (logging/tracing backends). Carries timing, latency, and metadata for observability.

Field	Type	Description
`event_type`	`str`	`"check_result"` / `"compaction"` / `"warning"` / `"critical"`
`check_name`	`str`	Name of the check
`tier`	`"static"` / `"semantic"`	Check classification
`score`	`float \\| None`	Check score
`passed`	`bool \\| None`	Whether the check passed
`tokens_before`	`int \\| None`	Token count before
`tokens_after`	`int \\| None`	Token count after
`latency_ms`	`float \\| None`	Check execution time

Consuming Events¶

from latent.agents.events import TextDelta
from latent.context.events import ContextCheckViolation

async for event in agent.stream(messages):
    if isinstance(event, ContextCheckViolation):
        if event.outcome == "active":
            print(f"[CONTEXT] {event.check_name}: {event.message}")
            if event.tokens_before and event.tokens_after:
                print(f"  Compacted: {event.tokens_before:,} -> {event.tokens_after:,} tokens")
        else:
            print(f"[CONTEXT WARNING] {event.check_name}: {event.message}")
    elif isinstance(event, TextDelta):
        print(event.text, end="", flush=True)

Production Architecture¶

A recommended layering for production agents:

Every turn (<5 ms)     Threshold-triggered        Periodic / CI
---------------------  -------------------------  -------------------------
TokenBudgetAuditor     CompactionScanner          PoisoningDetector
MiddleContentDetector  ObservationMaskingScanner   DistractionScorer
HistoryBloatDetector   ToolOutputOffloadScanner    ContradictionDetector
KVCacheStabilityAuditor SummaryInjectionScanner    CompressionQualityAuditor
ToolDescriptionLinter
SystemPromptStructureAuditor

Static validators are passive -- they log findings but don't block. Run them on every turn for continuous monitoring.

Active compactors fire only when utilization crosses a threshold. Set them as active/pre scanners so they rewrite the message array before the LLM call.

Semantic validators are expensive (LLM calls). Run them offline, in CI, or on a schedule -- not in the hot path. Note they are not yet executed by review_context(); invoke them directly (await scanner.scan_messages(...)) until they are wired in.

from latent.agents import ReActAgent
from latent.context import context_check
from latent.guardrails.scanners.context import (
    TokenBudgetAuditor,
    MiddleContentDetector,
    HistoryBloatDetector,
    KVCacheStabilityAuditor,
    ObservationMaskingScanner,
    CompactionScanner,
)

class ProductionAgent(ReActAgent):
    # --- Static validators (passive, every turn) ---
    @context_check(timing="pre", outcome="passive", tier="static")
    def budget_monitor(self):
        return TokenBudgetAuditor(model_limit=128_000)

    @context_check(timing="pre", outcome="passive", tier="static")
    def middle_content(self):
        return MiddleContentDetector()

    @context_check(timing="pre", outcome="passive", tier="static")
    def history_bloat(self):
        return HistoryBloatDetector(max_history_pct=0.6)

    @context_check(timing="pre", outcome="passive", tier="static")
    def kv_cache(self):
        return KVCacheStabilityAuditor()

    # --- Active compactors (trigger at threshold) ---
    @context_check(timing="pre", outcome="active", tier="static")
    def auto_mask(self):
        return ObservationMaskingScanner(keep_last_n=5)

    @context_check(timing="pre", outcome="active", tier="static")
    def auto_compact(self):
        return CompactionScanner(
            trigger_utilization=0.8,
            target_utilization=0.6,
            model_limit=128_000,
        )

Integration with Agent Types¶

ReActAgent¶

Use @context_check decorators on a subclass (shown above) or wrap with GuardrailMiddleware.

PipelineAgent¶

Pipeline agents use the same GuardrailMiddleware infrastructure. Context checks run before each phase's LLM call:

from latent.agents.pipeline import PipelineAgent, phase
from latent.context import context_check
from latent.guardrails.scanners.context import TokenBudgetAuditor

class MyPipeline(PipelineAgent):
    @context_check(timing="pre", outcome="passive", tier="static")
    def budget_check(self):
        return TokenBudgetAuditor(model_limit=128_000)

    @phase("classify")
    async def classify(self, state):
        ...

Programmatic Wrapping¶

For agents you don't control, wrap them:

from latent.guardrails import GuardrailMiddleware
from latent.guardrails.scanners.context import (
    TokenBudgetAuditor,
    ObservationMaskingScanner,
    CompactionScanner,
)

wrapped = GuardrailMiddleware(
    third_party_agent,
    pre_scanners=[
        TokenBudgetAuditor(model_limit=128_000),
        ObservationMaskingScanner(keep_last_n=3),
        CompactionScanner(
            trigger_utilization=0.8,
            target_utilization=0.6,
            model_limit=128_000,
        ),
    ],
)

Scanner ordering matters

Pre-scanners run in order. Place validators before compactors so monitoring metrics reflect the pre-compaction state.

Context Engineering¶

Quick Start¶

@context_check Decorator¶

Parameters¶

Signature Detection¶

Return Types¶

Static Validators¶

TokenBudgetAuditor¶

MiddleContentDetector¶

HistoryBloatDetector¶

KVCacheStabilityAuditor¶

ToolDescriptionLinter¶

SystemPromptStructureAuditor¶

Active Compactors¶

ObservationMaskingScanner¶

CompactionScanner¶

ToolOutputOffloadScanner¶

SummaryInjectionScanner¶

Semantic Validators¶

PoisoningDetector¶

DistractionScorer¶

ContradictionDetector¶

CompressionQualityAuditor¶