Ask a product team running a customer support chatbot what their containment rate is, and most can tell you a number. Ask them exactly how they define "containment" and the answers diverge widely. Some count any conversation that didn't explicitly request a human agent. Some count only conversations that ended with the bot marking the issue as resolved. Some count sessions where the user stopped sending messages after a bot response.
These are not the same metric. The differences matter because they predict completely different things about user experience and business outcomes. Getting containment measurement right is the prerequisite for improving it systematically - and the differences between a 55% and an 80% containment rate represent tens of thousands of agent-hours annually for any deployment at scale.
The Right Definition of Containment
Containment should mean: the user's issue was resolved by the automated system without requiring escalation to a human agent, AND the user did not immediately re-contact support for the same issue. Both conditions matter.
The second condition - no immediate re-contact - is what most teams leave out of their measurement. A conversation where the bot gives a confident but wrong answer, the user accepts it, and then submits a new ticket two hours later for the same issue is not a contained conversation. It's a deferred escalation that also wastes the user's time. True containment requires successful resolution, not just non-escalation.
In practice this means measuring containment requires tracking user sessions across a time window - typically 24-48 hours. If a user contacts support again within that window with a similar issue category, the original session retroactively fails containment. This is harder to implement than per-session metrics but produces numbers that correlate meaningfully with user satisfaction scores.
Where Containment Fails: The Three Root Causes
Fallback rate on intent classification. When the bot can't classify an intent with sufficient confidence, it falls back - either to a generic "I didn't understand" response or directly to escalation. Every fallback is a potential containment failure. Fallback rates above 15% on standard query distributions indicate an intent coverage problem, a training data problem, or an architectural problem with intent handling (see our analysis in building intent graphs instead of intent lists).
Context loss in multi-turn conversations. The bot understands each turn individually but fails on turns that require contextual reasoning. User says "fix the issue I mentioned earlier" and the bot has no idea what issue was mentioned earlier. This produces either a fallback or a wrong action - neither of which is contained. Context-related containment failures are disproportionately damaging because they occur later in conversations, after the user has already invested time.
Resolution accuracy on executed actions. The bot correctly identifies the intent, correctly executes the action, but the action doesn't actually fix the user's problem. This is a different kind of failure from the first two - it's not a dialogue management failure but an integration or data quality failure. But from a containment measurement perspective it looks identical: the conversation doesn't result in true resolution.
In our analysis across multiple production deployments, the distribution is roughly: 40% of containment failures come from intent fallbacks, 35% from context loss in multi-turn conversations, and 25% from resolution accuracy issues. The priority order for improvement investments should follow this distribution.
Improving Containment: What Actually Works
Intent coverage audit. Export all fallback conversations from the last 30 days. Cluster them by topic. The largest clusters represent intents you haven't built handlers for. Add intent definitions and training data for the top 5-10 clusters. A single weekend of intent coverage work can reduce fallback rate by 30-40% for deployments that haven't done this before.
Context loss diagnosis. Export conversations that ended in escalation or re-contact. Manually review a sample of 50. For each one, identify the turn where the conversation went wrong. If that turn involved a reference to a prior turn ("that," "it," "the issue I mentioned"), the cause is likely a coreference resolution failure. If it involved a topic switch or reference to a prior session, it may be a context persistence issue. Fixing these requires structural changes to context handling, as discussed in our article on why coreference resolution breaks in production chatbots.
Slot-filling completeness check. For intents that require taking an action (booking, cancellation, status check, update), are all required parameters being collected before the action executes? Partially-filled slots that allow an action to proceed with missing information produce wrong outputs that fail resolution. The dialog manager should block action dispatch until all required slots are filled or confirmed.
Clarification quality. When the bot can't proceed with high confidence, how does it ask for clarification? Vague clarification requests ("I'm not sure I understood - can you rephrase?") have low success rates. Specific clarification requests ("Are you asking about your order from March 12, or a different order?") have much higher success rates. Improving clarification generation requires knowing exactly what information is missing - which requires the entity tracking and intent state representation described above.
A Realistic Containment Improvement Timeline
For a deployment starting at 55-60% containment (which is roughly average for chatbot deployments without structured dialogue management), realistic improvements with focused effort:
- Month 1: Intent coverage audit and expansion. Expected improvement: 5-8 percentage points. Brings most deployments to 63-68%.
- Month 2-3: Context loss analysis and coreference resolution improvements. Expected improvement: 6-10 percentage points. Brings most deployments to 70-78%.
- Month 3-4: Resolution accuracy improvements (integration fixes, data quality). Expected improvement: 3-5 percentage points. Brings most deployments to 74-82%.
These numbers assume the deployment is already using a reasonable base model and has basic intent handling in place. Starting from a worse baseline, or adding structured dialogue management from scratch, can produce larger jumps. The 58% to 79% improvement we described in our about page came from a combination of proper coreference resolution, entity tracking, and intent graph implementation over roughly four months of focused development.
Monitoring Containment Without Overcomplicating It
Containment measurement doesn't need to be a complex data pipeline. The minimum viable measurement system:
- Log every conversation start, escalation event, and explicit resolution event
- For each user, check whether they re-contacted within 24 hours on the same issue category
- Compute: contained = (conversations ending in resolution with no re-contact within 24h) / (all conversations)
- Break down by intent category, conversation length, and first-turn fallback rate to identify where losses are concentrated
This produces a number you can act on. Tie it to your weekly deployment metrics. When containment drops, the breakdown by category tells you exactly where to look. When you make a change, the next week's metric tells you whether it worked.
Conclusion
Containment rate is the right primary metric for conversational AI deployments, but only when measured correctly. True containment requires successful resolution and no re-contact, not just non-escalation. The three root causes - intent fallbacks, context loss, and resolution accuracy failures - have different remediation strategies that should be pursued in priority order. Teams that measure containment correctly and improve it systematically consistently reach 75%+ within two to three months of focused work. That's the threshold above which users start reporting positive experiences with chatbot support rather than tolerating it.