Why Coreference Resolution Breaks in Production Chatbots

Coreference resolution fails quietly. No exception is thrown. The API returns a 200. The user gets an answer - just the wrong one. We analyzed 6,000 production dialogue turns across three customer deployments where users complained about bots that "seemed to forget what we were talking about." Three root causes appeared in over 80% of the failures, and they all trace back to one architectural decision made at the start of every dialogue stack: how the system represents conversational context.

What Coreference Resolution Actually Means in Dialogue

In NLP literature, coreference resolution refers to identifying when two expressions in a text refer to the same entity. "Sarah called the office. She said she'd be late." - "She" refers to Sarah in both cases. Most introductory NLP tutorials handle this as a document-level task. Dialogue is different.

In a multi-turn conversation, the referent may have been introduced three turns ago, may have been modified by a subsequent clarification, or may never have been stated explicitly at all - it could be implied by the slot values the user previously filled. A user who said "I need to fly from JFK to LAX on Tuesday" and then says "what if I delay it by a day?" is using "it" to refer to the departure date, not the entire trip, not the arrival time, not the flight itself. Flat intent classification cannot resolve this without a proper entity state representation.

This is not a model quality problem. It is an architecture problem. You cannot train your way out of a missing entity graph.

Root Cause 1: Sliding Window Context

The most common failure pattern we observed was in systems using a sliding window of N prior turns as the model's context input. The window approach is computationally cheap and works well for short exchanges. It breaks systematically when a referenced entity falls outside the window.

In our dataset, 34% of pronoun resolution failures occurred in turns 8-15 of a conversation, precisely the range where entities introduced in turns 1-5 drop out of a standard 5-turn window. The user had been gradually refining their request, the entity was clearly established, but the system lost track of it because the window moved on.

The fix is not a longer window. Longer windows increase latency and hallucination risk as the model must attend to irrelevant history. The correct solution is a structured entity state: a typed representation of the entities that have been introduced, their current attribute values, and their last mention. This state object is lightweight - typically under 2KB for a 30-turn conversation - and passes to the model explicitly alongside the current utterance.

Root Cause 2: Entity Overwriting Instead of Entity Updating

The second failure pattern appeared in systems that do maintain entity state, but implement it as a simple key-value store with overwrite semantics. When a user says "change my seat to the aisle" and then says "actually, move it to the window instead," a naive state store sets seat=aisle, then seat=window. So far so good.

The problem emerges when the user then asks "what did you change?" The system has no record of the intermediate state. It cannot say "I first set your seat to aisle, then you asked me to switch to window." It can only report the current value. In customer support workflows, this failure destroys trust - users expect the bot to know the conversation history, not just the current state snapshot.

The correct implementation treats entity updates as an event log, not a mutable state snapshot. Each modification is appended with a timestamp and the triggering utterance. The current value is derived by replaying the log. This enables full history queries, revision queries ("undo that last change"), and differential reporting ("here's what changed since we started").

Root Cause 3: Pronoun Attachment to the Wrong Antecedent

The third failure pattern is the hardest to fix and the most common in practice. When a user introduces multiple entities in a single utterance - "I want to book a hotel and a car; make the car a midsize" - the system must determine that "the car" in the follow-up refers to the car entity, not the hotel. When the user says "cancel it," the system must determine which entity "it" attaches to based on recency, salience, and conversational focus.

Most production systems use simple recency heuristics: attach the pronoun to the most recently mentioned entity. This works for clear cases and fails badly for compound or ambiguous ones. In our 6,000-turn dataset, recency-only attachment produced the wrong antecedent in 22% of cases where two entities had been mentioned in the prior two turns.

Salience-based resolution significantly outperforms recency alone. Entities that are the subject of the current conversational focus - typically the one that was last acted on, the one being questioned, or the one that triggered the current clarification - should be weighted higher regardless of recency. We implement this through a focus stack that tracks which entity is "in scope" at each turn, separate from which entity was mentioned last.

The False Promise of LLM-as-Dialogue-Manager

The response to these failures in recent years has been to replace structured dialogue systems with large language models used directly as dialogue managers. The logic is appealing: LLMs are trained on massive corpora of human dialogue and should handle pronoun resolution implicitly without explicit entity state.

This works in demos and fails in production for one specific reason: consistency. An LLM resolving coreferences implicitly will produce different resolutions for the same input depending on small variations in context formatting, temperature settings, or even which tokens appeared earlier in the prompt. A structured entity state produces deterministic resolutions. When a customer service SLA requires that the system not contradict itself across turns, determinism matters more than raw resolution accuracy.

The production-grade approach combines both: use the LLM for understanding and generation, but maintain explicit entity state outside the model. The model reads the state, the model writes proposed updates, and a deterministic state engine validates and commits those updates. Coreference resolution happens at the state engine level, not inside the LLM's attention mechanism.

Measuring Coreference Resolution Quality

Standard NLP benchmarks like CoNLL-2012 measure coreference at the document level. They are not useful for evaluating production dialogue systems. The metrics that matter in production are different: per-turn antecedent accuracy (what percentage of pronouns and definite descriptions attached to the correct entity), cross-turn resolution rate (what percentage of references to entities from three or more turns ago resolved correctly), and correction handling rate (when a user says "no, I meant the other one," does the system update the correct entity).

We measure all three continuously in production. Antecedent accuracy in our stack runs at 94.2% across all active deployments. Cross-turn resolution, which most competing systems do not measure at all, runs at 89.7%. Correction handling - the hardest case - runs at 91.1%. These numbers come from comparing system-resolved references against human-annotated ground truth on a 500-turn weekly sample.

What to Look for When Evaluating a Dialogue Platform

If you are evaluating conversational AI platforms and coreference quality matters for your use case, ask three questions. First: does the platform maintain explicit entity state, or does it rely entirely on the LLM's implicit context? Second: is entity state an event log or a mutable snapshot? Third: what is the platform's measured cross-turn resolution rate on conversations longer than ten turns?

If the vendor cannot answer the third question with a number, they have not measured it. If they have not measured it, they do not know where it fails. As we described in our article on building intent graphs, structured representations outperform statistical correlations precisely in the edge cases that matter most in production.

Conclusion

Three root causes account for most coreference failures in production: sliding window context that drops early entities, overwrite semantics that erase conversation history, and recency-only pronoun attachment. All three trace back to the same architectural shortcut: treating context as a text buffer instead of a structured state object. Fix the architecture, and resolution quality follows. Add more parameters to a broken architecture, and you get better-sounding failures.

If you want to see how Equmenopolis's entity state engine handles these cases, the Platform page walks through the implementation. The developer plan includes full access to entity state inspection at no cost.