Building Intent Graphs Instead of Intent Lists

Every dialogue system tutorial starts with intent classification: train a model to map an utterance to one of N predefined categories. "Book a flight" maps to BookFlight. "Cancel my reservation" maps to CancelReservation. This works for single-action utterances. The moment users start combining requests in one turn - which they do constantly in practice - flat intent classification produces wrong answers with high confidence. The fix is not better training data. It is a different representation: intent as a graph, not a list.

The Problem with Flat Intent Classification

Flat intent lists assume one intent per utterance. Every multi-class classifier produces exactly one label, or at most a ranked list of labels with scores. When a user says "Book me a flight to Denver and a hotel for three nights," a classifier might return BookFlight with 0.87 confidence, which is arguably correct - but it drops the hotel booking entirely. The system responds to the flight request, books it, confirms, and moves on. The user repeats the hotel request. The system treats it as a new turn.

This is not a low-probability edge case. In customer service dialogues, 28% of utterances in our production dataset contained more than one distinct action request in a single turn. Calls handling travel, commerce, and account management are particularly dense with compound requests. Flat classification fails on more than a quarter of turns by design.

Multi-label classification is the obvious first attempt at a fix. Instead of returning one label, return all labels with scores above a threshold. This handles "book a flight AND a hotel" correctly. It breaks on utterances where the intents are not independent: "Cancel my flight and rebook it for Thursday" requires understanding that the rebooking is contingent on the cancellation, not a parallel independent action. Multi-label classification has no mechanism to represent dependency or ordering between co-occurring intents.

What an Intent Graph Looks Like

An intent graph represents an utterance as a directed graph where nodes are intent instances and edges are typed relations between them. Each intent node carries the full slot values extracted for that intent. Edges carry relation types: SEQUENTIAL (A must happen before B), CONDITIONAL (B happens only if A succeeds), COORDINATED (A and B happen in parallel), and ALTERNATIVE (A or B, not both).

"Cancel my flight and rebook it for Thursday" produces two nodes: CancelFlight and BookFlight. The edge between them is typed SEQUENTIAL. The BookFlight node carries a slot {departure_date: Thursday}. The CancelFlight node carries the slot values from the current booking context. The system executes cancellation first, then, on success, executes the rebooking with the inherited context.

"Book a flight to Denver and a hotel for three nights" produces two nodes: BookFlight with {destination: Denver} and BookHotel with {duration: 3 nights}. The edge is typed COORDINATED. Both can be processed in parallel, and neither depends on the other's success.

"Book a flight or just get me a train ticket if there are no seats" produces two nodes with an ALTERNATIVE edge. The system tries the first; only on failure does it attempt the second. This cannot be expressed as multi-label classification at all.

Graph Extraction from Natural Language

Building the intent graph from a raw utterance is a two-stage process. The first stage is span detection: identify the segments of the utterance that correspond to distinct action requests. This is a token-level sequence labeling problem, not a sentence-level classification problem. A fine-tuned token classifier identifies intent-bearing spans and labels them with intent type. For "cancel my flight and rebook it for Thursday," the model identifies two spans: "cancel my flight" labeled CancelFlight and "rebook it for Thursday" labeled BookFlight.

The second stage is relation classification: given the identified spans, determine the edge types between them. Discourse connectives like "and," "but first," "or," and "if that works" are strong signals. "And" typically signals COORDINATED. "And then" or "then" signals SEQUENTIAL. "Or" signals ALTERNATIVE. "If" signals CONDITIONAL. These are heuristics; the relation classifier handles ambiguous cases where connectives are absent or misleading.

End-to-end accuracy on our internal benchmark: 91.3% for span detection, 87.6% for full graph structure including edge types. The hardest cases are utterances with implicit sequencing ("book the flight then the hotel" - SEQUENTIAL even without "first") and implicit alternatives ("see if there's a direct flight, otherwise connect through Chicago").

Slot Propagation Across Intent Nodes

One advantage of the graph representation is natural slot propagation. When a user says "rebook it," the pronoun "it" refers to the existing booking in context. The rebooking intent node inherits the passenger details, origin, and any other slot values from the prior booking entity. This propagation follows the entity state mechanism we described in our article on coreference resolution: the entity graph tracks which slot values are active in context and which intent nodes should inherit them.

Slot propagation fails in flat architectures because there is no graph to propagate through. Each intent is classified independently, and slot filling starts from scratch for each. The user must re-specify every piece of information the system already knows. With intent graph propagation, the user specifies only the delta - what changed - not the full specification.

This produces measurable efficiency gains in production. In our travel booking deployment, average turns-to-completion dropped from 8.3 to 5.1 after switching from flat intent classification to intent graph processing. Users provided the same information but were not required to repeat already-known values when modifying prior requests.

Handling Intent Correction and Revision

Intent graphs also handle mid-dialog correction more cleanly than flat architectures. "Actually, change that to Friday instead" is a revision of a prior intent node, not a new intent. In a flat system, this often gets classified as a new BookFlight intent, causing the system to start a new booking flow rather than modifying the in-progress one. In the graph model, revision utterances attach to an existing node and update its slot values rather than creating a new root node.

The graph model also handles intent cancellation: "forget the hotel, just the flight" removes the BookHotel node from the graph and changes any coordination edges to BookFlight to INDEPENDENT. The system continues processing only the flight request.

Implementing this cleanly requires that the dialogue state tracks the active intent graph explicitly and that the NLU component can produce graph modifications (add node, remove node, update slot, change edge type) in addition to new graph construction. These modification operations add complexity but eliminate the class of confusion bugs that occur when a user tries to correct a prior request in a stateless flat intent system.

Performance Considerations

The main objection to intent graph processing is latency. Two-stage extraction (span detection + relation classification) adds a sequential dependency that increases the critical path compared to single-pass flat classification. In our production stack, the full pipeline runs in 18ms on P95 for utterances with up to three intent nodes. Single-pass flat classification runs in 4ms. The 14ms overhead is real.

Whether that overhead matters depends on the deployment context. For voice interfaces where total latency budgets are 300ms, 14ms for NLU leaves enough headroom. For high-frequency chatbot deployments where you are processing thousands of turns per second and cost-optimizing, the overhead is significant and the flat architecture may be preferable for the 72% of single-intent utterances. We expose both paths in the API: the caller can specify whether to use full graph extraction or fast flat classification based on their latency-accuracy tradeoff.

When Intent Graphs Are Not the Right Tool

Intent graphs are the right architecture for task-oriented dialogue in domains where compound requests are common and order dependency matters. They are not the right tool for open-domain conversational AI where utterances do not map to structured actions at all. "What do you think about the weather?" has no intent graph. Trying to force open-domain conversation through a task intent classifier produces worse results than a direct language model response.

The practical rule: if your dialogue system needs to execute actions in an external system (book, cancel, query, update), use intent graphs. If your dialogue system is primarily answering questions or generating responses, use a language model with retrieval-augmented generation. The two architectures solve different problems, and the choice is made by the application, not the vendor.

Conclusion

Flat intent classification was the right starting point when dialogue systems were simple command-response engines. Production conversations are not simple. Users combine requests, chain actions, correct prior statements, and expect the system to track all of it. Intent graphs represent these structures correctly. Flat lists do not. The 14ms latency overhead for full graph extraction is a fair price for eliminating an entire category of coordination and sequencing failures that flat architectures cannot handle at all.

Equmenopolis's Platform page includes a live demo of intent graph visualization - you can see the graph structure for your own test utterances before writing a single line of integration code.