Most teams track one dialogue metric: containment rate. If the bot handled the conversation without escalating to a human, it counts as success. Containment rate is easy to measure and largely useless for diagnosing quality problems. A bot can achieve 80% containment by confidently giving wrong answers that users eventually accept. What containment does not tell you is where in the dialogue the system is failing, which intents are low-confidence, and whether context tracking is working at all. Here are the metrics that actually predict user frustration before it shows up in support tickets.
Intent Confidence Distribution (Not Just Average)
Average intent confidence is a misleading summary statistic. A system that scores 0.95 on 90% of utterances and 0.45 on 10% will show an average confidence of 0.905 - which looks fine until you realize that 10% of your turns are producing unreliable intent classifications. Low-confidence intents are where your bot starts saying things that do not match what the user asked, and users start getting frustrated.
The useful metric is the confidence distribution: what fraction of utterances fall below 0.7, below 0.5, below 0.4? Track this as a histogram by intent class. You will typically find that 2-3 intent classes account for the majority of low-confidence classifications, and those are exactly the intents where your training data is thin or where user phrasing diverges from your training examples. Fixing those intent classes directly reduces fallback rates.
More usefully: track confidence trends over time. If the low-confidence fraction is growing, users are expressing new request types that your NLU model has not seen. This is an early warning that your training data is falling behind user behavior. Surface it before users start abandoning the bot.
Slot Fill Rate by Slot Type
A successfully classified intent that is missing required slot values is not a successful turn. If the BookFlight intent requires origin, destination, and departure_date, but only destination is extracted on 40% of BookFlight turns, the system must ask for the missing slots in clarification turns - adding latency, increasing turn count, and creating the impression that the bot is slow and tedious.
Slot fill rate measures what fraction of required slots are successfully extracted on the first turn an intent is identified. Grouped by slot type, this metric directly shows which slot types your entity extraction is underperforming on. Date expressions are commonly low-fill; relative dates like "next Tuesday," "day after tomorrow," and "end of next month" all require calendar resolution and many systems get these wrong. Location spans with unusual formatting or abbreviations are another common failure.
Cross-referencing slot fill rate with session transcript samples is the fastest way to build improvement priority queues. Find the slot types with fill rates below 80%, read 20 examples where extraction failed, and you will typically identify a small set of linguistic patterns that a targeted data augmentation pass can fix.
Context Hit Rate
Context hit rate is the fraction of inter-turn references that resolved successfully to an entity in the active context. This metric is unique to multi-turn dialogue systems and entirely absent from standard NLU benchmark reporting. It directly measures whether your context management is working.
A context hit is when the user says "change that to Friday" and the system correctly identifies "that" as referring to the departure_date entity that was set in a prior turn. A context miss is when the system fails to resolve the reference and either produces a fallback response ("I'm sorry, what would you like to change?") or, worse, applies the change to the wrong entity.
Context miss rate above 15% is a signal that context tracking is broken for a meaningful fraction of references. Context misses are correlated with session abandonment - users who get "I'm not sure what you'd like to change" responses three times in a session are far more likely to abandon the bot than users whose references resolve correctly. In our production data, sessions with context miss rate above 20% showed 3.1x higher abandonment rates than sessions with miss rates below 5%.
Clarification Turn Rate
Every time the bot asks "Did you mean X or Y?" or "Could you be more specific?", it is spending a turn recovering from an NLU failure. Clarification turns are necessary when utterances are genuinely ambiguous. They are wasteful when the information needed to resolve the ambiguity was already in context and was simply not used.
Track clarification turn rate both globally and segmented by prior-context availability: what fraction of clarification turns occurred when the information needed for resolution was actually present in the active context? This fraction measures unnecessary clarification - cases where better context utilization would have allowed the system to resolve the utterance without asking.
In a baseline evaluation of 12 production dialogue systems before Equmenopolis integration, unnecessary clarification rate ranged from 18% to 41% of all clarification turns. Integrating entity state context reduced unnecessary clarification to under 7% in all 12 cases. The improvement does not require better NLU - it requires making the NLU aware of what has already been established in the session.
Turn-to-Task-Completion
How many turns does it take to complete a specific task? This is the most directly actionable efficiency metric for task-oriented dialogue. Establish baselines for your most common task types: account lookup, booking, cancellation, status query. Track median and P90 turn counts per task type over time.
Turn count creep - where tasks are taking more turns to complete over time - indicates model drift, context failures, or increasing user request complexity. Turn count reduction after a system update validates that the improvement achieved its goal. A customer service deployment we worked with saw median turns-to-resolution drop from 9.1 to 4.8 after implementing context-aware slot propagation. That 46% reduction translated to 39% lower cost per resolved ticket.
Repeat Phrase Rate
A repeat phrase rate measures how often users repeat information they have already provided in the current session. "My name is John - I said that already" is a signal captured in the user's phrasing, not in the system logs. To detect it automatically: calculate the semantic similarity of consecutive user utterances and flag sessions where a user provides the same named entity value more than once.
High repeat phrase rate is a direct signal that context tracking is not retaining slot values from prior turns. Users who repeat themselves more than twice in a session show significantly higher frustration signals: shorter final utterance lengths, use of "again," "already," and "I just said," and higher rates of mid-session abandonment without completing the task.
Building a Dialog Quality Dashboard
The six metrics described here form a complete picture of dialogue quality: intent confidence distribution (NLU accuracy), slot fill rate (entity extraction quality), context hit rate (context tracking quality), clarification turn rate (unnecessary friction), turn-to-task-completion (overall efficiency), and repeat phrase rate (user experience signal). Track all six, segment by intent type and session length, and you have a complete diagnosis of where your dialogue system needs work.
Equmenopolis's Dialog Analytics module exports all six metrics in real time. Session replay allows you to drill into any metric spike and read the actual turns that contributed to it. If you are currently tracking only containment rate, you are flying blind on the dimensions that actually determine whether users are frustrated. The Platform page has more details on what the analytics module captures.
Conclusion
Containment rate measures whether the bot survived a conversation. The metrics described here measure whether the bot did the conversation well. Teams that track only containment optimize for bot survival at the expense of user experience. Teams that track intent confidence distribution, slot fill rates, context hit rates, and turn efficiency understand exactly what is broken and have clear improvement targets. The data is already in your production logs. The question is whether you are surfacing it as actionable metrics or letting it go unread.