Skip to content

Conversation Metrics

Evaluating multi-turn conversations: turn-level scoring, SOP compliance, and trajectory analysis.

Most LLM evaluations score single prompt-response pairs. Real deployments are multi-turn conversations where quality degrades, agents skip steps, and tool-call sequences diverge from the happy path. latent.stats provides purpose-built primitives for all of these.

Conversation Data Model

Turn and Conversation

A Conversation is a sequence of Turn objects. Each turn captures a single message in the dialogue.

from latent.stats import Turn, Conversation

conv = Conversation(turns=[
    Turn(role="user", content="I need help with my order"),
    Turn(role="assistant", content="I'd be happy to help! What's your order number?"),
    Turn(role="user", content="Order #12345"),
    Turn(role="assistant", content="I found your order. It shipped yesterday."),
])

Turn fields:

Field Type Required Description
role str Yes "user", "assistant", or "system"
content str Yes The message text
timestamp str \| None No When the turn occurred
tool_calls list[dict] \| None No Tool calls made during this turn
phase str No SOP phase label (e.g., "greeting", "resolve")
metadata dict No Arbitrary key-value pairs

Turn-Level Scoring

turn_level_scores

Apply any scoring function to each turn independently. Returns an array of per-turn scores that you can aggregate, plot, or feed into downstream analysis.

from latent.stats import turn_level_scores

scores = turn_level_scores(conv, score_fn=lambda t: len(t.content) / 100)
# array([0.31, 0.5 , 0.12, 0.47])

The score_fn receives a Turn and returns a float. Use this with LLM-as-judge scorers, regex detectors, or any custom logic.

aggregate_turn_scores

Collapse per-turn scores into a single conversation-level number.

from latent.stats import aggregate_turn_scores

# Simple mean across all turns
overall = aggregate_turn_scores(scores, strategy="mean")

# Worst-turn evaluation -- surface the weakest link
worst = aggregate_turn_scores(scores, strategy="min")

# Final-turn quality -- how did the conversation end?
final = aggregate_turn_scores(scores, strategy="last")

# Weighted -- later turns matter more
import numpy as np
weights = np.array([0.1, 0.2, 0.3, 0.4])
weighted = aggregate_turn_scores(scores, strategy="weighted", weights=weights)

Available strategies:

Strategy Behavior When to use
"mean" Arithmetic mean of all turns General quality
"min" Lowest score across turns Safety, compliance (one bad turn fails)
"max" Highest score across turns Best-case analysis
"weighted" Weighted average (requires weights) Later turns or assistant-only emphasis
"last" Score of the final turn Resolution quality

turn_position_analysis

Analyze how scores change by turn position across a corpus of conversations. This is the key tool for detecting quality degradation in long conversations.

from latent.stats import turn_position_analysis

# conversations: list[Conversation]
# score_fn: Turn -> float
results = turn_position_analysis(
    conversations,
    score_fn=my_judge,
    max_turns=10,
)

for r in results:
    print(f"Turn {r.name}: {r.point_estimate:.3f} [{r.ci_lower:.3f}, {r.ci_upper:.3f}]")

Returns one MetricResult per position with bootstrap confidence intervals. Plot these to visualize the quality curve across turn depth.

Tip

Filter to assistant-only turns with score_fn returning None for user turns. turn_position_analysis skips None values automatically.

SOP Compliance

Standard Operating Procedures define the phases an agent must follow. latent.stats models SOPs as directed acyclic graphs (DAGs) of phases with prerequisites.

SOPDefinition and SOPPhase

from latent.stats import SOPDefinition, SOPPhase

sop = SOPDefinition(phases=[
    SOPPhase(name="greeting", required=True),
    SOPPhase(name="identify_issue", required=True, prerequisites=["greeting"]),
    SOPPhase(name="resolve", required=True, prerequisites=["identify_issue"]),
    SOPPhase(name="upsell", required=False),
    SOPPhase(name="close", required=True, prerequisites=["resolve"]),
])

Each SOPPhase has:

  • name -- unique identifier
  • required -- must this phase be completed?
  • prerequisites -- list of phase names that must come before
  • weight -- relative importance (default 1.0)

phase_completion_score

What percentage of required phases were completed? Weighted by phase weight.

from latent.stats import phase_completion_score

score = phase_completion_score(
    actual_phases=["greeting", "identify_issue", "resolve", "close"],
    sop=sop,
)
print(f"Completion: {score.point_estimate:.0%}")  # 100%

Missing a required phase drops the score proportionally. Optional phases (like "upsell" above) do not affect the score when absent.

sequence_compliance

Did the phases happen in the correct order? Uses edit distance against valid topological orderings of the SOP DAG, normalized to [0, 1] where 1.0 means perfect ordering.

from latent.stats import sequence_compliance

# Correct order
result = sequence_compliance(
    actual_phases=["greeting", "identify_issue", "resolve", "close"],
    sop=sop,
)
print(f"Order compliance: {result.point_estimate:.2f}")  # 1.00

# Out-of-order: resolved before identifying issue
result = sequence_compliance(
    actual_phases=["greeting", "resolve", "identify_issue", "close"],
    sop=sop,
)
print(f"Order compliance: {result.point_estimate:.2f}")  # < 1.0

Note

sequence_compliance handles DAGs with multiple valid orderings. For example, if "upsell" and "close" both depend only on "resolve", either order is considered correct.

Trajectory Analysis

tool_trajectory_distance

Compare actual tool-call sequences against an ideal (golden) trajectory using Levenshtein edit distance. Normalized so 1.0 = identical, 0.0 = completely different.

from latent.stats import tool_trajectory_distance

result = tool_trajectory_distance(
    actual=["search", "lookup", "search", "respond"],
    ideal=["search", "lookup", "respond"],
)
print(f"Trajectory similarity: {result.point_estimate:.2f}")

Warning

When the actual sequence is more than 2x the ideal length, the result includes a warning in result.metadata["warnings"]. This typically indicates the agent is stuck in a retry loop.

conversational_edit_distance

Edit distance between two full conversation turn sequences. Useful for comparing agent behavior against a reference conversation.

from latent.stats import conversational_edit_distance

result = conversational_edit_distance(
    actual_turns=conv.turns,
    reference_turns=reference_conv.turns,
)
print(f"Conversation similarity: {result.point_estimate:.2f}")

Key behaviors:

  • Role mismatches are treated as infinity cost (insert + delete), because substituting a user turn for an assistant turn is never a minor edit.
  • With embedding_fn: substitution cost is 1 - cosine_similarity(embed(a), embed(b)), enabling semantic matching.
  • Without embedding_fn: exact string match (substitution cost = 0 if identical, 1 otherwise).
  • Normalized to [0, 1] where 1.0 = identical conversations.

Putting It Together

A complete conversation evaluation pipeline: score turns, check SOP compliance, measure trajectory accuracy, and combine into a report.

import numpy as np
from latent.stats import (
    Conversation,
    Turn,
    turn_level_scores,
    aggregate_turn_scores,
    phase_completion_score,
    sequence_compliance,
    tool_trajectory_distance,
    analyze,
    SOPDefinition,
    SOPPhase,
)

# --- 1. Define your SOP ---
sop = SOPDefinition(phases=[
    SOPPhase(name="greeting", required=True),
    SOPPhase(name="identify_issue", required=True, prerequisites=["greeting"]),
    SOPPhase(name="resolve", required=True, prerequisites=["identify_issue"]),
    SOPPhase(name="close", required=True, prerequisites=["resolve"]),
])

# --- 2. Score each turn with a judge ---
def judge(turn: Turn) -> float:
    """Your LLM-as-judge or heuristic scorer."""
    if turn.role != "assistant":
        return None  # skip user turns
    # ... call your judge here ...
    return score

conversations: list[Conversation] = load_conversations()

all_scores = []
for conv in conversations:
    scores = turn_level_scores(conv, score_fn=judge)
    # Filter None values (user turns)
    assistant_scores = scores[~np.isnan(scores)]
    overall = aggregate_turn_scores(assistant_scores, strategy="mean")
    all_scores.append(overall)

# --- 3. Check SOP compliance ---
completion_scores = []
order_scores = []
for conv in conversations:
    phases = [t.phase for t in conv.turns if t.phase is not None]
    completion_scores.append(
        phase_completion_score(phases, sop).point_estimate
    )
    order_scores.append(
        sequence_compliance(phases, sop).point_estimate
    )

# --- 4. Measure trajectory accuracy ---
ideal_tools = ["search_kb", "lookup_order", "send_response"]
trajectory_scores = []
for conv in conversations:
    actual_tools = [
        tc for t in conv.turns for tc in (t.tool_calls or [])
    ]
    trajectory_scores.append(
        tool_trajectory_distance(actual_tools, ideal_tools).point_estimate
    )

# --- 5. Combine into a report ---
report = analyze(
    scores={
        "turn_quality": np.array(all_scores),
        "sop_completion": np.array(completion_scores),
        "sop_ordering": np.array(order_scores),
        "trajectory_accuracy": np.array(trajectory_scores),
    },
    score_types={
        "turn_quality": "continuous",
        "sop_completion": "continuous",
        "sop_ordering": "continuous",
        "trajectory_accuracy": "continuous",
    },
)

for m in report.metrics:
    print(f"{m.name}: {m.point_estimate:.3f} [{m.ci_lower:.3f}, {m.ci_upper:.3f}]")

Tip

Use render_markdown(report) to generate a human-readable summary, or log_to_mlflow(report) to track conversation-level metrics alongside your model experiments. See Reporting for details.