Conversation Metrics¶
Evaluating multi-turn conversations: turn-level scoring, SOP compliance, and trajectory analysis.
Most LLM evaluations score single prompt-response pairs. Real deployments are multi-turn conversations where quality degrades, agents skip steps, and tool-call sequences diverge from the happy path. latent.stats provides purpose-built primitives for all of these.
Conversation Data Model¶
Turn and Conversation¶
A Conversation is a sequence of Turn objects. Each turn captures a single message in the dialogue.
from latent.stats import Turn, Conversation
conv = Conversation(turns=[
Turn(role="user", content="I need help with my order"),
Turn(role="assistant", content="I'd be happy to help! What's your order number?"),
Turn(role="user", content="Order #12345"),
Turn(role="assistant", content="I found your order. It shipped yesterday."),
])
Turn fields:
| Field | Type | Required | Description |
|---|---|---|---|
role |
str |
Yes | "user", "assistant", or "system" |
content |
str |
Yes | The message text |
timestamp |
str \| None |
No | When the turn occurred |
tool_calls |
list[dict] \| None |
No | Tool calls made during this turn |
phase |
str |
No | SOP phase label (e.g., "greeting", "resolve") |
metadata |
dict |
No | Arbitrary key-value pairs |
Turn-Level Scoring¶
turn_level_scores¶
Apply any scoring function to each turn independently. Returns an array of per-turn scores that you can aggregate, plot, or feed into downstream analysis.
from latent.stats import turn_level_scores
scores = turn_level_scores(conv, score_fn=lambda t: len(t.content) / 100)
# array([0.31, 0.5 , 0.12, 0.47])
The score_fn receives a Turn and returns a float. Use this with LLM-as-judge scorers, regex detectors, or any custom logic.
aggregate_turn_scores¶
Collapse per-turn scores into a single conversation-level number.
from latent.stats import aggregate_turn_scores
# Simple mean across all turns
overall = aggregate_turn_scores(scores, strategy="mean")
# Worst-turn evaluation -- surface the weakest link
worst = aggregate_turn_scores(scores, strategy="min")
# Final-turn quality -- how did the conversation end?
final = aggregate_turn_scores(scores, strategy="last")
# Weighted -- later turns matter more
import numpy as np
weights = np.array([0.1, 0.2, 0.3, 0.4])
weighted = aggregate_turn_scores(scores, strategy="weighted", weights=weights)
Available strategies:
| Strategy | Behavior | When to use |
|---|---|---|
"mean" |
Arithmetic mean of all turns | General quality |
"min" |
Lowest score across turns | Safety, compliance (one bad turn fails) |
"max" |
Highest score across turns | Best-case analysis |
"weighted" |
Weighted average (requires weights) |
Later turns or assistant-only emphasis |
"last" |
Score of the final turn | Resolution quality |
turn_position_analysis¶
Analyze how scores change by turn position across a corpus of conversations. This is the key tool for detecting quality degradation in long conversations.
from latent.stats import turn_position_analysis
# conversations: list[Conversation]
# score_fn: Turn -> float
results = turn_position_analysis(
conversations,
score_fn=my_judge,
max_turns=10,
)
for r in results:
print(f"Turn {r.name}: {r.point_estimate:.3f} [{r.ci_lower:.3f}, {r.ci_upper:.3f}]")
Returns one MetricResult per position with bootstrap confidence intervals. Plot these to visualize the quality curve across turn depth.
Tip
Filter to assistant-only turns with score_fn returning None for user turns.
turn_position_analysis skips None values automatically.
SOP Compliance¶
Standard Operating Procedures define the phases an agent must follow. latent.stats models SOPs as directed acyclic graphs (DAGs) of phases with prerequisites.
SOPDefinition and SOPPhase¶
from latent.stats import SOPDefinition, SOPPhase
sop = SOPDefinition(phases=[
SOPPhase(name="greeting", required=True),
SOPPhase(name="identify_issue", required=True, prerequisites=["greeting"]),
SOPPhase(name="resolve", required=True, prerequisites=["identify_issue"]),
SOPPhase(name="upsell", required=False),
SOPPhase(name="close", required=True, prerequisites=["resolve"]),
])
Each SOPPhase has:
name-- unique identifierrequired-- must this phase be completed?prerequisites-- list of phase names that must come beforeweight-- relative importance (default 1.0)
phase_completion_score¶
What percentage of required phases were completed? Weighted by phase weight.
from latent.stats import phase_completion_score
score = phase_completion_score(
actual_phases=["greeting", "identify_issue", "resolve", "close"],
sop=sop,
)
print(f"Completion: {score.point_estimate:.0%}") # 100%
Missing a required phase drops the score proportionally. Optional phases (like "upsell" above) do not affect the score when absent.
sequence_compliance¶
Did the phases happen in the correct order? Uses edit distance against valid topological orderings of the SOP DAG, normalized to [0, 1] where 1.0 means perfect ordering.
from latent.stats import sequence_compliance
# Correct order
result = sequence_compliance(
actual_phases=["greeting", "identify_issue", "resolve", "close"],
sop=sop,
)
print(f"Order compliance: {result.point_estimate:.2f}") # 1.00
# Out-of-order: resolved before identifying issue
result = sequence_compliance(
actual_phases=["greeting", "resolve", "identify_issue", "close"],
sop=sop,
)
print(f"Order compliance: {result.point_estimate:.2f}") # < 1.0
Note
sequence_compliance handles DAGs with multiple valid orderings. For example,
if "upsell" and "close" both depend only on "resolve", either order is
considered correct.
Trajectory Analysis¶
tool_trajectory_distance¶
Compare actual tool-call sequences against an ideal (golden) trajectory using Levenshtein edit distance. Normalized so 1.0 = identical, 0.0 = completely different.
from latent.stats import tool_trajectory_distance
result = tool_trajectory_distance(
actual=["search", "lookup", "search", "respond"],
ideal=["search", "lookup", "respond"],
)
print(f"Trajectory similarity: {result.point_estimate:.2f}")
Warning
When the actual sequence is more than 2x the ideal length, the result includes
a warning in result.metadata["warnings"]. This typically indicates the agent
is stuck in a retry loop.
conversational_edit_distance¶
Edit distance between two full conversation turn sequences. Useful for comparing agent behavior against a reference conversation.
from latent.stats import conversational_edit_distance
result = conversational_edit_distance(
actual_turns=conv.turns,
reference_turns=reference_conv.turns,
)
print(f"Conversation similarity: {result.point_estimate:.2f}")
Key behaviors:
- Role mismatches are treated as infinity cost (insert + delete), because substituting a user turn for an assistant turn is never a minor edit.
- With
embedding_fn: substitution cost is1 - cosine_similarity(embed(a), embed(b)), enabling semantic matching. - Without
embedding_fn: exact string match (substitution cost = 0 if identical, 1 otherwise). - Normalized to [0, 1] where 1.0 = identical conversations.
Putting It Together¶
A complete conversation evaluation pipeline: score turns, check SOP compliance, measure trajectory accuracy, and combine into a report.
import numpy as np
from latent.stats import (
Conversation,
Turn,
turn_level_scores,
aggregate_turn_scores,
phase_completion_score,
sequence_compliance,
tool_trajectory_distance,
analyze,
SOPDefinition,
SOPPhase,
)
# --- 1. Define your SOP ---
sop = SOPDefinition(phases=[
SOPPhase(name="greeting", required=True),
SOPPhase(name="identify_issue", required=True, prerequisites=["greeting"]),
SOPPhase(name="resolve", required=True, prerequisites=["identify_issue"]),
SOPPhase(name="close", required=True, prerequisites=["resolve"]),
])
# --- 2. Score each turn with a judge ---
def judge(turn: Turn) -> float:
"""Your LLM-as-judge or heuristic scorer."""
if turn.role != "assistant":
return None # skip user turns
# ... call your judge here ...
return score
conversations: list[Conversation] = load_conversations()
all_scores = []
for conv in conversations:
scores = turn_level_scores(conv, score_fn=judge)
# Filter None values (user turns)
assistant_scores = scores[~np.isnan(scores)]
overall = aggregate_turn_scores(assistant_scores, strategy="mean")
all_scores.append(overall)
# --- 3. Check SOP compliance ---
completion_scores = []
order_scores = []
for conv in conversations:
phases = [t.phase for t in conv.turns if t.phase is not None]
completion_scores.append(
phase_completion_score(phases, sop).point_estimate
)
order_scores.append(
sequence_compliance(phases, sop).point_estimate
)
# --- 4. Measure trajectory accuracy ---
ideal_tools = ["search_kb", "lookup_order", "send_response"]
trajectory_scores = []
for conv in conversations:
actual_tools = [
tc for t in conv.turns for tc in (t.tool_calls or [])
]
trajectory_scores.append(
tool_trajectory_distance(actual_tools, ideal_tools).point_estimate
)
# --- 5. Combine into a report ---
report = analyze(
scores={
"turn_quality": np.array(all_scores),
"sop_completion": np.array(completion_scores),
"sop_ordering": np.array(order_scores),
"trajectory_accuracy": np.array(trajectory_scores),
},
score_types={
"turn_quality": "continuous",
"sop_completion": "continuous",
"sop_ordering": "continuous",
"trajectory_accuracy": "continuous",
},
)
for m in report.metrics:
print(f"{m.name}: {m.point_estimate:.3f} [{m.ci_lower:.3f}, {m.ci_upper:.3f}]")
Tip
Use render_markdown(report) to generate a human-readable summary, or
log_to_mlflow(report) to track conversation-level metrics alongside
your model experiments. See Reporting for details.