Conversation Metrics¶

Evaluating multi-turn conversations: turn-level scoring, SOP compliance, and trajectory analysis.

Most LLM evaluations score single prompt-response pairs. Real deployments are multi-turn conversations where quality degrades, agents skip steps, and tool-call sequences diverge from the happy path. latent.stats provides purpose-built primitives for all of these.

Conversation Data Model¶

Turn and Conversation¶

A Conversation is a sequence of Turn objects. Each turn captures a single message in the dialogue.

from latent.stats import Turn, Conversation

conv = Conversation(turns=[
    Turn(role="user", content="I need help with my order"),
    Turn(role="assistant", content="I'd be happy to help! What's your order number?"),
    Turn(role="user", content="Order #12345"),
    Turn(role="assistant", content="I found your order. It shipped yesterday."),
])

Turn fields:

Field	Type	Required	Description
`role`	`str`	Yes	`"user"`, `"assistant"`, or `"system"`
`content`	`str`	Yes	The message text
`timestamp`	`str \\| None`	No	When the turn occurred
`tool_calls`	`list[dict] \\| None`	No	Tool calls made during this turn
`phase`	`str`	No	SOP phase label (e.g., `"greeting"`, `"resolve"`)
`metadata`	`dict`	No	Arbitrary key-value pairs

Turn-Level Scoring¶

`turn_level_scores`¶

Apply any scoring function to each turn independently. Returns an array of per-turn scores that you can aggregate, plot, or feed into downstream analysis.

from latent.stats import turn_level_scores

scores = turn_level_scores(conv, score_fn=lambda t: len(t.content) / 100)
# array([0.31, 0.5 , 0.12, 0.47])

The score_fn receives a Turn and returns a float. Use this with LLM-as-judge scorers, regex detectors, or any custom logic.

`aggregate_turn_scores`¶

Collapse per-turn scores into a single conversation-level number.

from latent.stats import aggregate_turn_scores

# Simple mean across all turns
overall = aggregate_turn_scores(scores, strategy="mean")

# Worst-turn evaluation -- surface the weakest link
worst = aggregate_turn_scores(scores, strategy="min")

# Final-turn quality -- how did the conversation end?
final = aggregate_turn_scores(scores, strategy="last")

# Weighted -- later turns matter more
import numpy as np
weights = np.array([0.1, 0.2, 0.3, 0.4])
weighted = aggregate_turn_scores(scores, strategy="weighted", weights=weights)

Available strategies:

Strategy	Behavior	When to use
`"mean"`	Arithmetic mean of all turns	General quality
`"min"`	Lowest score across turns	Safety, compliance (one bad turn fails)
`"max"`	Highest score across turns	Best-case analysis
`"weighted"`	Weighted average (requires `weights`)	Later turns or assistant-only emphasis
`"last"`	Score of the final turn	Resolution quality

`turn_position_analysis`¶

Analyze how scores change by turn position across a corpus of conversations. This is the key tool for detecting quality degradation in long conversations.

from latent.stats import turn_position_analysis

# conversations: list[Conversation]
# score_fn: Turn -> float
results = turn_position_analysis(
    conversations,
    score_fn=my_judge,
    max_turns=10,
)

for r in results:
    print(f"Turn {r.name}: {r.point_estimate:.3f} [{r.ci_lower:.3f}, {r.ci_upper:.3f}]")

Returns one MetricResult per position with bootstrap confidence intervals. Plot these to visualize the quality curve across turn depth.

Tip

Filter to assistant-only turns with score_fn returning None for user turns. turn_position_analysis skips None values automatically.

SOP Compliance¶

Standard Operating Procedures define the phases an agent must follow. latent.stats models SOPs as directed acyclic graphs (DAGs) of phases with prerequisites.

SOPDefinition and SOPPhase¶

from latent.stats import SOPDefinition, SOPPhase

sop = SOPDefinition(phases=[
    SOPPhase(name="greeting", required=True),
    SOPPhase(name="identify_issue", required=True, prerequisites=["greeting"]),
    SOPPhase(name="resolve", required=True, prerequisites=["identify_issue"]),
    SOPPhase(name="upsell", required=False),
    SOPPhase(name="close", required=True, prerequisites=["resolve"]),
])

Each SOPPhase has:

name -- unique identifier
required -- must this phase be completed?
prerequisites -- list of phase names that must come before
weight -- relative importance (default 1.0)

`phase_completion_score`¶

What percentage of required phases were completed? Weighted by phase weight.

from latent.stats import phase_completion_score

score = phase_completion_score(
    actual_phases=["greeting", "identify_issue", "resolve", "close"],
    sop=sop,
)
print(f"Completion: {score.point_estimate:.0%}")  # 100%

Missing a required phase drops the score proportionally. Optional phases (like "upsell" above) do not affect the score when absent.

`sequence_compliance`¶

Did the phases happen in the correct order? Uses edit distance against valid topological orderings of the SOP DAG, normalized to [0, 1] where 1.0 means perfect ordering.

from latent.stats import sequence_compliance

# Correct order
result = sequence_compliance(
    actual_phases=["greeting", "identify_issue", "resolve", "close"],
    sop=sop,
)
print(f"Order compliance: {result.point_estimate:.2f}")  # 1.00

# Out-of-order: resolved before identifying issue
result = sequence_compliance(
    actual_phases=["greeting", "resolve", "identify_issue", "close"],
    sop=sop,
)
print(f"Order compliance: {result.point_estimate:.2f}")  # < 1.0

Note

sequence_compliance handles DAGs with multiple valid orderings. For example, if "upsell" and "close" both depend only on "resolve", either order is considered correct.

Trajectory Analysis¶

`tool_trajectory_distance`¶

Compare actual tool-call sequences against an ideal (golden) trajectory using Levenshtein edit distance. Normalized so 1.0 = identical, 0.0 = completely different.

from latent.stats import tool_trajectory_distance

result = tool_trajectory_distance(
    actual=["search", "lookup", "search", "respond"],
    ideal=["search", "lookup", "respond"],
)
print(f"Trajectory similarity: {result.point_estimate:.2f}")

Warning

When the actual sequence is more than 2x the ideal length, the result includes a warning in result.metadata["warnings"]. This typically indicates the agent is stuck in a retry loop.

`conversational_edit_distance`¶

Edit distance between two full conversation turn sequences. Useful for comparing agent behavior against a reference conversation.

from latent.stats import conversational_edit_distance

result = conversational_edit_distance(
    actual_turns=conv.turns,
    reference_turns=reference_conv.turns,
)
print(f"Conversation similarity: {result.point_estimate:.2f}")

Key behaviors:

Role mismatches are treated as infinity cost (insert + delete), because substituting a user turn for an assistant turn is never a minor edit.
With embedding_fn: substitution cost is 1 - cosine_similarity(embed(a), embed(b)), enabling semantic matching.
Without embedding_fn: exact string match (substitution cost = 0 if identical, 1 otherwise).
Normalized to [0, 1] where 1.0 = identical conversations.

Putting It Together¶

A complete conversation evaluation pipeline: score turns, check SOP compliance, measure trajectory accuracy, and combine into a report.

import numpy as np
from latent.stats import (
    Conversation,
    Turn,
    turn_level_scores,
    aggregate_turn_scores,
    phase_completion_score,
    sequence_compliance,
    tool_trajectory_distance,
    analyze,
    SOPDefinition,
    SOPPhase,
)

# --- 1. Define your SOP ---
sop = SOPDefinition(phases=[
    SOPPhase(name="greeting", required=True),
    SOPPhase(name="identify_issue", required=True, prerequisites=["greeting"]),
    SOPPhase(name="resolve", required=True, prerequisites=["identify_issue"]),
    SOPPhase(name="close", required=True, prerequisites=["resolve"]),
])

# --- 2. Score each turn with a judge ---
def judge(turn: Turn) -> float:
    """Your LLM-as-judge or heuristic scorer."""
    if turn.role != "assistant":
        return None  # skip user turns
    # ... call your judge here ...
    return score

conversations: list[Conversation] = load_conversations()

all_scores = []
for conv in conversations:
    scores = turn_level_scores(conv, score_fn=judge)
    # Filter None values (user turns)
    assistant_scores = scores[~np.isnan(scores)]
    overall = aggregate_turn_scores(assistant_scores, strategy="mean")
    all_scores.append(overall)

# --- 3. Check SOP compliance ---
completion_scores = []
order_scores = []
for conv in conversations:
    phases = [t.phase for t in conv.turns if t.phase is not None]
    completion_scores.append(
        phase_completion_score(phases, sop).point_estimate
    )
    order_scores.append(
        sequence_compliance(phases, sop).point_estimate
    )

# --- 4. Measure trajectory accuracy ---
ideal_tools = ["search_kb", "lookup_order", "send_response"]
trajectory_scores = []
for conv in conversations:
    actual_tools = [
        tc for t in conv.turns for tc in (t.tool_calls or [])
    ]
    trajectory_scores.append(
        tool_trajectory_distance(actual_tools, ideal_tools).point_estimate
    )

# --- 5. Combine into a report ---
report = analyze(
    scores={
        "turn_quality": np.array(all_scores),
        "sop_completion": np.array(completion_scores),
        "sop_ordering": np.array(order_scores),
        "trajectory_accuracy": np.array(trajectory_scores),
    },
    score_types={
        "turn_quality": "continuous",
        "sop_completion": "continuous",
        "sop_ordering": "continuous",
        "trajectory_accuracy": "continuous",
    },
)

for m in report.metrics:
    print(f"{m.name}: {m.point_estimate:.3f} [{m.ci_lower:.3f}, {m.ci_upper:.3f}]")

Tip

Use render_markdown(report) to generate a human-readable summary, or log_to_mlflow(report) to track conversation-level metrics alongside your model experiments. See Reporting for details.

Conversation Metrics¶

Conversation Data Model¶

Turn and Conversation¶

Turn-Level Scoring¶

turn_level_scores¶

aggregate_turn_scores¶

turn_position_analysis¶