Context Engineering¶
Context engineering is the discipline of managing what goes into an LLM's context window -- and when. Token budgets are finite, attention is U-shaped (models attend strongly to the beginning and end of their context but weakly to the middle), and once you exceed the limit, performance doesn't degrade gracefully -- it falls off a cliff.
Latent provides a full toolkit for monitoring, compacting, and validating LLM context:
- Static validators run every turn in under 5 ms -- no LLM calls.
- Active compactors transform context when token budgets tighten.
- Semantic validators use an LLM to audit context quality (periodic or offline).
- Utility functions for token estimation, budget breakdown, and history compaction.
review_context()is an offline linter that runs all static validators and returns a structured report.
Quick Start¶
Two ways to add context checks to your agents: the @context_check decorator (on agent subclasses) or GuardrailMiddleware (programmatic wrapping).
from latent.agents import LiteLLMAgent
from latent.context import context_check
from latent.guardrails.scanners.context import (
TokenBudgetAuditor,
ObservationMaskingScanner,
)
class MyAgent(LiteLLMAgent):
@context_check(timing="pre", outcome="passive", tier="static")
def budget_audit(self):
return TokenBudgetAuditor(model_limit=128_000)
@context_check(timing="pre", outcome="active", tier="static")
def auto_compact(self):
return ObservationMaskingScanner(keep_last_n=3)
from latent.agents import LiteLLMAgent
from latent.guardrails import GuardrailMiddleware
from latent.guardrails.scanners.context import (
TokenBudgetAuditor,
CompactionScanner,
)
agent = LiteLLMAgent(name="assistant", model="gpt-4o")
wrapped = GuardrailMiddleware(
agent,
pre_scanners=[
TokenBudgetAuditor(model_limit=128_000),
CompactionScanner(
trigger_utilization=0.8,
target_utilization=0.6,
model_limit=128_000,
),
],
)
@context_check Decorator¶
Marks a method on a BaseAgent subclass as a context engineering check. Works like @guardrail but adds tier and trigger parameters and emits ContextCheckViolation / ContextCheckEvent instead of guardrail events.
from latent.context import context_check
class MyAgent(LiteLLMAgent):
@context_check(
timing="pre",
outcome="active",
tier="static",
trigger=0.8,
message="Context too large, compacting.",
)
def budget_gate(self, messages):
total = sum(len(m.get("content", "").encode("utf-8")) // 4 for m in messages)
return total > 100_000 # True = violation
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
timing |
"pre" / "post" |
"pre" |
When to run: before or after generation |
outcome |
"active" / "passive" |
"active" |
Active blocks/compacts; passive logs only |
threshold |
float |
0.5 |
Score threshold for float-returning rules |
on_error |
"ignore" / "raise" |
"ignore" |
Error handling strategy |
message |
str |
"Request blocked by context check." |
User-facing message when blocked |
every |
int |
0 |
Post rules: 0 = end-of-stream only, N = every N tokens |
tier |
"static" / "semantic" |
"static" |
Classifies the check type for tracing |
trigger |
float |
0.5 |
Trigger threshold (stored on scanner for downstream use) |
Signature Detection¶
The decorator infers the check type from the method signature:
| Signature | Type | Behavior |
|---|---|---|
(self) |
Factory | Called once at init; must return a scanner instance |
(self, messages: list) |
Pre inline | Runs on each input (receives the full message array) |
(self, messages, output) |
Post inline | Runs on each output |
Return Types¶
| Return type | Semantics |
|---|---|
-> bool |
True = violation, False = pass |
-> float |
Score compared against threshold; >= threshold = violation |
-> ScanResult |
Full control -- pass through directly |
Static Validators¶
Fast checks (<5 ms, no LLM calls) designed to run on every turn.
TokenBudgetAuditor¶
Checks overall context utilization against model limits. Reports severity at three thresholds.
from latent.guardrails.scanners.context import TokenBudgetAuditor
scanner = TokenBudgetAuditor(
model_limit=200_000, # context window size
warn_pct=0.7, # log warning above 70%
compact_pct=0.8, # fail (trigger compaction) above 80%
critical_pct=0.9, # critical severity above 90%
)
result = scanner.scan_messages(messages, tools)
# result["score"] = utilization ratio (0.0-1.0)
# result["metadata"]["severity"] = "ok" | "warn" | "compact" | "critical"
| Parameter | Type | Default | Description |
|---|---|---|---|
model_limit |
int |
200_000 |
Context window size in tokens |
warn_pct |
float |
0.7 |
Warning threshold |
compact_pct |
float |
0.8 |
Compaction trigger threshold |
critical_pct |
float |
0.9 |
Critical threshold |
MiddleContentDetector¶
Flags critical instructions that fall in the attention trough (the middle 80% of the context, between the 10th and 90th percentile by character position). LLMs attend most strongly to the beginning and end.
from latent.guardrails.scanners.context import MiddleContentDetector
scanner = MiddleContentDetector(
critical_patterns=["must", "never", "always", "constraint", "do not", "important"],
min_total_tokens=4000, # skip check for short contexts
)
result = scanner.scan_messages(messages)
# result["metadata"]["findings"] = [{"pattern": "must", "position_pct": 0.45, "message_role": "system"}]
| Parameter | Type | Default | Description |
|---|---|---|---|
critical_patterns |
list[str] |
["must", "never", "always", "constraint", "do not", "important"] |
Patterns to search for |
min_total_tokens |
int |
4000 |
Minimum context size before checking |
HistoryBloatDetector¶
Fails if user/assistant history consumes more than max_history_pct of total tokens. A bloated history crowds out system prompts and tool definitions.
from latent.guardrails.scanners.context import HistoryBloatDetector
scanner = HistoryBloatDetector(max_history_pct=0.6)
result = scanner.scan_messages(messages, tools)
# result["metadata"]["history_pct"] = 0.72
| Parameter | Type | Default | Description |
|---|---|---|---|
max_history_pct |
float |
0.6 |
Maximum acceptable history proportion |
KVCacheStabilityAuditor¶
Detects dynamic values in the system prompt that break KV-cache reuse across requests. Scans for ISO timestamps, UUIDs, session IDs, request counters, and version strings with patch components.
from latent.guardrails.scanners.context import KVCacheStabilityAuditor
scanner = KVCacheStabilityAuditor()
result = scanner.scan_messages(messages)
# result["metadata"]["cache_breakers"] = [{"pattern": "iso_timestamp", "matched_text": "2025-01-15T10:30:00"}]
Fix: move dynamic values to user messages
Instead of injecting timestamps or session IDs into the system prompt, pass them in a user message or tool result. The system prompt prefix then stays identical across requests, enabling KV-cache hits.
ToolDescriptionLinter¶
Validates tool definitions for quality. Checks each tool for:
- Description exists and is substantial (>10 chars)
- Description mentions what it returns
- Parameter count is 8 or fewer
- Name follows
verb_nounconvention
from latent.guardrails.scanners.context import ToolDescriptionLinter
scanner = ToolDescriptionLinter()
result = scanner.scan_messages(messages, tools)
# result["metadata"]["tool_findings"] = [{"tool_name": "getData", "issues": ["Name does not follow verb_noun convention"]}]
SystemPromptStructureAuditor¶
Checks system prompt structure for best practices:
- Identity statement in the first 200 characters (e.g., "You are a...")
- Edge-anchored constraints -- critical directives should appear in both the first 10% and last 10% of the prompt
- Altitude consistency -- paragraphs should not mix high-level directives ("You must always...") with implementation details (code fences, URLs)
from latent.guardrails.scanners.context import SystemPromptStructureAuditor
scanner = SystemPromptStructureAuditor()
result = scanner.scan_messages(messages)
# result["metadata"]["issues"] = ["Critical constraints not edge-anchored (missing at end of prompt)"]
Active Compactors¶
Scanners that transform context by returning rewritten_messages on ScanResult. The middleware replaces the original message array with the rewritten version before forwarding to the agent.
ObservationMaskingScanner¶
Replaces old tool outputs with one-line summaries, keeping the most recent keep_last_n tool results intact.
from latent.guardrails.scanners.context import ObservationMaskingScanner
scanner = ObservationMaskingScanner(keep_last_n=3)
result = scanner.scan_messages(messages)
# Older tool outputs become: "[Tool output masked -- 1250 tokens]"
# result["rewritten_messages"] contains the compacted message array
| Parameter | Type | Default | Description |
|---|---|---|---|
keep_last_n |
int |
3 |
Number of recent tool outputs to preserve verbatim |
CompactionScanner¶
Triggers full context compaction when token utilization exceeds a threshold. Keeps system messages intact, preserves the most recent non-system messages that fit within the target budget, and injects a summary of dropped messages.
from latent.guardrails.scanners.context import CompactionScanner
scanner = CompactionScanner(
trigger_utilization=0.8, # compact when above 80%
target_utilization=0.6, # compact down to 60%
model_limit=200_000,
)
result = scanner.scan_messages(messages)
# result["metadata"]["tokens_before"] and result["metadata"]["tokens_after"]
| Parameter | Type | Default | Description |
|---|---|---|---|
trigger_utilization |
float |
0.8 |
Utilization ratio that triggers compaction |
target_utilization |
float |
0.6 |
Target utilization after compaction |
model_limit |
int |
200_000 |
Context window size in tokens |
ToolOutputOffloadScanner¶
Saves large tool outputs to scratch files and replaces them with a summary and file path reference in the message array.
from latent.guardrails.scanners.context import ToolOutputOffloadScanner
scanner = ToolOutputOffloadScanner(
max_output_tokens=2000, # offload outputs larger than this
scratch_dir="/tmp/agent_scratch", # where to save files
)
result = scanner.scan_messages(messages)
# Large outputs become: "[Output saved to /tmp/agent_scratch/search_0.txt. 5200 tokens. Summary: ...]"
| Parameter | Type | Default | Description |
|---|---|---|---|
max_output_tokens |
int |
2000 |
Token threshold for offloading |
scratch_dir |
str \| None |
auto (temp dir) | Directory for offloaded files |
SummaryInjectionScanner¶
Injects a summary system message into long conversations. Activates after every_n_messages non-system messages and injects a one-per-conversation summary (idempotent -- skips if a summary already exists).
from latent.guardrails.scanners.context import SummaryInjectionScanner
scanner = SummaryInjectionScanner(every_n_messages=20)
result = scanner.scan_messages(messages)
# Injects a "[Conversation summary]" system message after existing system messages
| Parameter | Type | Default | Description |
|---|---|---|---|
every_n_messages |
int |
20 |
Minimum non-system messages before injecting |
Semantic Validators¶
LLM-backed validators for deeper analysis. These make API calls and should be used periodically, offline, or in CI -- not on every turn.
PoisoningDetector¶
Detects hallucinated facts that re-enter the context. Compares tool outputs against assistant messages to find unverified claims being repeated.
from latent.guardrails.scanners.context import PoisoningDetector
scanner = PoisoningDetector(model="gpt-4o-mini")
result = scanner.scan_messages(messages)
# result["metadata"]["unverified_claims"] = 2
# result["metadata"]["examples"] = ["The API supports batch mode"]
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
"gpt-4o-mini" |
Model for fact-verification |
DistractionScorer¶
Scores each message for relevance to the current task objective. High distraction scores indicate off-topic content that could confuse the model.
from latent.guardrails.scanners.context import DistractionScorer
scanner = DistractionScorer(model="gpt-4o-mini")
result = scanner.scan_messages(messages)
# result["score"] = average distraction score (0.0 = relevant, 1.0 = off-topic)
# result["metadata"]["per_message_scores"] = [0.1, 0.0, 0.8, ...]
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
"gpt-4o-mini" |
Model for relevance scoring |
ContradictionDetector¶
Finds factual contradictions between system instructions and tool outputs. Returns severity-scored contradiction pairs.
from latent.guardrails.scanners.context import ContradictionDetector
scanner = ContradictionDetector(model="gpt-4o-mini")
result = scanner.scan_messages(messages)
# result["metadata"]["contradiction_pairs"] = [
# {"system_claim": "...", "tool_claim": "...", "severity": 0.9}
# ]
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
"gpt-4o-mini" |
Model for contradiction detection |
CompressionQualityAuditor¶
Evaluates quality of compressed context using probe questions across four dimensions:
- recall -- Can specific facts from earlier be recalled?
- artifact -- Are references to files, URLs, and code still traceable?
- continuation -- Is there enough context to continue coherently?
- decision -- Can key decisions and their rationale be identified?
from latent.guardrails.scanners.context import CompressionQualityAuditor
scanner = CompressionQualityAuditor(
model="gpt-4o-mini",
probes=["recall", "artifact", "continuation", "decision"],
)
result = scanner.scan_messages(messages)
# result["metadata"]["dimension_scores"] = {"recall": 4, "artifact": 3, ...}
# result["metadata"]["rationale"] = {"recall": "Key facts are preserved...", ...}
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
"gpt-4o-mini" |
Model for quality probing |
probes |
list[str] |
["recall", "artifact", "continuation", "decision"] |
Probe dimensions |
Utility Functions¶
Lower-level functions for token estimation and context manipulation. All functions return new lists and never mutate input.
estimate_tokens¶
UTF-8 byte estimation: len(text.encode("utf-8")) // 4. Handles multilingual text better than character counting (Hebrew characters are 2 bytes each, so char-count would under-estimate by ~2x).
from latent.context import estimate_tokens
tokens = estimate_tokens("Hello, world!") # ~3
tokens = estimate_tokens("shalom olam") # correct for Hebrew
budget_breakdown¶
Compute per-component token allocation from a messages array.
from latent.context import budget_breakdown
breakdown = budget_breakdown(messages, tools, model_limit=128_000)
print(f"System: {breakdown.system_tokens}")
print(f"History: {breakdown.history_tokens}")
print(f"Tool defs: {breakdown.tool_def_tokens}")
print(f"Tool output: {breakdown.tool_output_tokens}")
print(f"Available: {breakdown.available_tokens}")
print(f"Utilization: {breakdown.utilization:.1%}")
mask_observations¶
Replace old tool-result messages with one-line summaries, keeping the last keep_last_n intact.
from latent.context import mask_observations
compacted = mask_observations(messages, keep_last_n=3)
# Older tool results become: "[Tool output summarized: 4500 chars from search_web]"
compact_history¶
Reduce message history to fit within a target token budget. Two strategies:
tool_results_first(default) -- mask tool outputs oldest-first, then drop oldest user/assistant pairs.oldest_first-- drop oldest non-system messages first.
System messages are never dropped.
from latent.context import compact_history
compacted = compact_history(
messages,
target_tokens=50_000,
strategy="tool_results_first",
)
anchored_summarize¶
Structured iterative summarization with anchored sections. Keeps system messages and the last keep_last_n messages intact. Middle messages are summarized into mandatory sections:
- Session Intent
- Files Modified
- Decisions Made
- Current State
- Next Steps
from latent.context import anchored_summarize
compacted = anchored_summarize(
messages,
keep_last_n=5,
sections=None, # use defaults
summary_fn=my_llm, # optional LLM-backed summarizer
)
LLM-backed summarization
Pass a summary_fn(text) -> str for higher-quality summaries. Without it, a simple extractive approach is used (first and last lines of the middle block).
compact_diffs¶
Cap recent diffs in iterative context, compacting older ones to one-liners. Useful for optimization loops where diffs accumulate.
from latent.context import compact_diffs
compacted = compact_diffs(context_markdown, max_recent_full=3)
# Older diffs become: "iter 1: +15/-3 lines in agent.py, tools.py"
review_context() -- Offline Linter¶
The CI / offline entry point. Runs all six static validators against a message array and returns a structured ContextReport.
from latent.context import review_context
report = review_context(
messages,
tools=tool_definitions,
model_limit=128_000,
include_semantic=False, # set True for LLM-backed checks
model="gpt-4o-mini", # model for semantic validators
)
print(report.has_critical) # bool
print(report.has_warnings) # bool
print(report.render_markdown())
for finding in report.findings:
print(f"[{finding.severity}] {finding.scanner_name}: {finding.message}")
if finding.suggestion:
print(f" Fix: {finding.suggestion}")
ContextReport¶
| Field | Type | Description |
|---|---|---|
findings |
list[Finding] |
All findings from scanners |
budget |
BudgetBreakdown \| None |
Token budget breakdown |
has_critical |
bool |
Any critical-severity findings |
has_warnings |
bool |
Any warning-severity findings |
Finding¶
| Field | Type | Description |
|---|---|---|
scanner_name |
str |
Which scanner produced the finding |
severity |
"info" / "warning" / "critical" |
Severity level |
message |
str |
Human-readable description |
suggestion |
str |
Remediation hint |
metadata |
dict |
Scanner-specific data |
CI Integration¶
Use review_context() in your test suite to catch context issues before deployment:
import pytest
from latent.context import review_context
def test_agent_context_quality():
messages = build_test_messages()
tools = get_agent_tools()
report = review_context(messages, tools, model_limit=128_000)
assert not report.has_critical, report.render_markdown()
for finding in report.findings:
assert finding.severity != "critical", finding.message
Events and Tracing¶
Context checks emit their own event types, distinct from guardrail events, so traces are clear about what triggered a check.
ContextCheckViolation¶
Yielded in the agent event stream when a context check fires. Extends AgentEvent.
| Field | Type | Description |
|---|---|---|
check_name |
str |
Name of the check |
timing |
"pre" / "post" |
When it fired |
outcome |
"active" / "passive" |
Whether it blocks |
tier |
"static" / "semantic" |
Check classification |
score |
float |
Check score |
message |
str |
User-facing message |
tokens_before |
int \| None |
Token count before compaction |
tokens_after |
int \| None |
Token count after compaction |
findings |
list[dict] |
Detailed findings |
ContextCheckEvent¶
Emitted to sinks (logging/tracing backends). Carries timing, latency, and metadata for observability.
| Field | Type | Description |
|---|---|---|
event_type |
str |
"check_result" / "compaction" / "warning" / "critical" |
check_name |
str |
Name of the check |
tier |
"static" / "semantic" |
Check classification |
score |
float \| None |
Check score |
passed |
bool \| None |
Whether the check passed |
tokens_before |
int \| None |
Token count before |
tokens_after |
int \| None |
Token count after |
latency_ms |
float \| None |
Check execution time |
Consuming Events¶
from latent.agents.events import TextDelta
from latent.context.events import ContextCheckViolation
async for event in agent.stream(messages):
if isinstance(event, ContextCheckViolation):
if event.outcome == "active":
print(f"[CONTEXT] {event.check_name}: {event.message}")
if event.tokens_before and event.tokens_after:
print(f" Compacted: {event.tokens_before:,} -> {event.tokens_after:,} tokens")
else:
print(f"[CONTEXT WARNING] {event.check_name}: {event.message}")
elif isinstance(event, TextDelta):
print(event.text, end="", flush=True)
Production Architecture¶
A recommended layering for production agents:
Every turn (<5 ms) Threshold-triggered Periodic / CI
--------------------- ------------------------- -------------------------
TokenBudgetAuditor CompactionScanner PoisoningDetector
MiddleContentDetector ObservationMaskingScanner DistractionScorer
HistoryBloatDetector ToolOutputOffloadScanner ContradictionDetector
KVCacheStabilityAuditor SummaryInjectionScanner CompressionQualityAuditor
ToolDescriptionLinter
SystemPromptStructureAuditor
Static validators are passive -- they log findings but don't block. Run them on every turn for continuous monitoring.
Active compactors fire only when utilization crosses a threshold. Set them as active/pre scanners so they rewrite the message array before the LLM call.
Semantic validators are expensive (LLM calls). Run them offline, in CI, or on a schedule -- not in the hot path.
from latent.agents import LiteLLMAgent
from latent.context import context_check
from latent.guardrails.scanners.context import (
TokenBudgetAuditor,
MiddleContentDetector,
HistoryBloatDetector,
KVCacheStabilityAuditor,
ObservationMaskingScanner,
CompactionScanner,
)
class ProductionAgent(LiteLLMAgent):
# --- Static validators (passive, every turn) ---
@context_check(timing="pre", outcome="passive", tier="static")
def budget_monitor(self):
return TokenBudgetAuditor(model_limit=128_000)
@context_check(timing="pre", outcome="passive", tier="static")
def middle_content(self):
return MiddleContentDetector()
@context_check(timing="pre", outcome="passive", tier="static")
def history_bloat(self):
return HistoryBloatDetector(max_history_pct=0.6)
@context_check(timing="pre", outcome="passive", tier="static")
def kv_cache(self):
return KVCacheStabilityAuditor()
# --- Active compactors (trigger at threshold) ---
@context_check(timing="pre", outcome="active", tier="static")
def auto_mask(self):
return ObservationMaskingScanner(keep_last_n=5)
@context_check(timing="pre", outcome="active", tier="static")
def auto_compact(self):
return CompactionScanner(
trigger_utilization=0.8,
target_utilization=0.6,
model_limit=128_000,
)
Integration with Agent Types¶
LiteLLMAgent¶
Use @context_check decorators on a subclass (shown above) or wrap with GuardrailMiddleware.
PipelineAgent¶
Pipeline agents use the same GuardrailMiddleware infrastructure. Context checks run before each phase's LLM call:
from latent.agents.pipeline import PipelineAgent, phase
from latent.context import context_check
from latent.guardrails.scanners.context import TokenBudgetAuditor
class MyPipeline(PipelineAgent):
@context_check(timing="pre", outcome="passive", tier="static")
def budget_check(self):
return TokenBudgetAuditor(model_limit=128_000)
@phase("classify")
async def classify(self, state):
...
Programmatic Wrapping¶
For agents you don't control, wrap them:
from latent.guardrails import GuardrailMiddleware
from latent.guardrails.scanners.context import (
TokenBudgetAuditor,
ObservationMaskingScanner,
CompactionScanner,
)
wrapped = GuardrailMiddleware(
third_party_agent,
pre_scanners=[
TokenBudgetAuditor(model_limit=128_000),
ObservationMaskingScanner(keep_last_n=3),
CompactionScanner(
trigger_utilization=0.8,
target_utilization=0.6,
model_limit=128_000,
),
],
)
Scanner ordering matters
Pre-scanners run in order. Place validators before compactors so monitoring metrics reflect the pre-compaction state.