The LLM Narrates, the Core Decides: Anatomy of an AI Checkout

By Sam Wen, CEO & Founder of XInfer.AI — July 2, 2026

How we built a checkout that runs the same protocol in a chat window, a voice session, and a live phone call.

One word, many moves

On the AI storefront platform I work on, a customer can type "checkout" into a chat window — or say it to a voice agent or a phone line — and, a short conversation later, they have an order number. The transcript looks unremarkable: a question about special instructions, maybe a contact form, a quick "Is everything correct?", done.

Beneath that brief exchange lies the most carefully engineered path in the system. Checkout is where a language model, several tools, an order engine, and a human who may be walking down the street all have to agree on what just happened — and where any disagreement costs real money.

This post explains our approach: what it is, why the constraints force this particular shape, and how the pieces move together in text and in voice. Then, because a design only means something relative to its alternatives, I'll compare it with the two other ways this is usually built.

Part I: The approach

The whole design reduces to one sentence: the language model conducts the conversation, but a deterministic core makes every decision.

Concretely, checking out involves two tools the model can call:

shopping_cart — cart operations, with submit as the consequential one.
collect_user_info — a small sub-protocol for contact details, with three actions: show (open a form, possibly prefilled), update (fill in fields), and confirm (save and close).

When the model calls shopping_cart(action="submit"), the request lands in a single server-side function shared by every channel. That core resolves who is buying, decides whether anything still needs to be asked, and answers with a typed outcome signal rather than prose:

NEED_ORDER_DETAILS — you submitted before asking about the order itself; go ask.
NEED_CUSTOMER_INFO — no usable contact information; open the form.
CONFIRM_CUSTOMER_INFO — contact is on file from before; read it back and get a yes.
SUBMITTED — the order exists; here is its number.
A failure — with a reason the model can relay.

The model's system prompt contains no customer data at all — only the protocol for reacting to each signal. And every resubmission carries accumulated flags: orderDetailsConfirmed=true once the order questions were asked, contactConfirmed=true once the customer approved their on-file details. The flags are how a multi-turn negotiation avoids going in circles — anything already settled travels forward, so no step is ever repeated, and no step can be skipped, because the core simply won't proceed without the flag.

That's the approach. The model narrates; the core decides. Everything else in this post is the reasoning behind that split and the machinery that makes it work.

Part II: Why this shape

Four constraints pushed us here, and I think at least three of them apply to anyone building an agent that transacts.

1. Channels multiply everything you fork. The same store sells through web chat, an embeddable widget, a mobile app, in-app voice, real phone calls, and SMS. Early on, each channel had its own glue between the model and the business logic — and the glue drifted. An audit found the kinds of bugs drift produces: an order lookup that skipped its ownership check on one channel but not another, phone-only customers who could buy in text but were refused by voice, and one channel that reported failures as if they had succeeded. None of these were design mistakes; they were synchronization mistakes. The fix wasn't more review discipline — it was making sure a decision exists in exactly one place, so there is nothing to keep in sync.

2. The voice prompt is frozen. Our voice agents run on third-party conversational AI suppliers, and their system prompts are compiled into configurations that persist on the suppliers' systems. You cannot inject "this customer's name is Pat" into the prompt at call time. This kills the most common trick in LLM app design — personalizing the prompt — and forces a cleaner one: the prompt stays static and describes only the protocol; anything per customer rides in the tool results. A prompt written weeks ago handles today's caller because today's caller's details arrive as data, not as instructions.

3. A phone call has no screen. In text chat, "collect the customer's info" can mean rendering a form. On a phone call, there is no form, no button, nothing to tap. Every step of the protocol, therefore, needs a purely conversational equivalent — which turns out to be a forcing function for good design: if a step can't survive being spoken aloud, it was probably leaning on UI to paper over ambiguity.

4. The model cannot be trusted with money. Not as an insult — as an engineering fact. A language model will occasionally fill in a customer's email address with something it saw earlier in the conversation, announce an order as placed before anything has happened, or retry a timed-out submission, creating the order twice. These aren't edge cases you prompt away; they're the model doing what models do. The decisions that cost money — whose contact to use, whether to place the order, whether this submission already happened — have to live where they can be made deterministically and tested.

Part III: How it works

The choreography in text

Here is a first-time guest's checkout, annotated with what actually happens:

Customer: checkout

Model:  "Any special notes for your order?"          ← protocol: always ask first
Customer: no, that's it

Model → shopping_cart(submit, orderDetailsConfirmed=true)
Core:   resolve contact → nothing anywhere → NEED_CUSTOMER_INFO

Model → collect_user_info(show)                       ← a contact form renders in chat
Customer: fills in name, email, phone; taps Confirm   ← saved to their profile

Model → shopping_cart(submit, orderDetailsConfirmed=true)
Core:   contact was entered in THIS conversation → proceed
        → order engine → SUBMITTED, order #1042

Model:  "Order #1042 placed — 2 items, $34.50.
         A confirmation email is on its way."

Two things to notice. The model never asked for the customer's email in free-form chat — a structured surface was available, so the form collected it, validated it, and normalized it. (On a phone call no such surface exists and asking out loud is all there is — we'll come back to what stays structured even then.) And when the second submit went through, it still carried orderDetailsConfirmed=true: settled things stay settled.

Now the same customer, back next week:

Model → shopping_cart(submit, orderDetailsConfirmed=true)
Core:   contact found ON FILE from a previous visit
        → not yet a trusted repeat buyer
        → CONFIRM_CUSTOMER_INFO {name, email, phone}

Model:  "I have you down as Pat Lee, pat@example.com,
         (555) 010-3985 — should I change anything?"
Customer: Actually, use my work email

Model → collect_user_info(show)                       ← form opens, prefilled
Customer: edits the email, confirms

Model → shopping_cart(submit, orderDetailsConfirmed=true,
                       contactConfirmed=true)
Core:   → SUBMITTED

The on-file values traveled inside the tool result — the model read them back rather than remembering them. That detail seems small in text. It's the entire ballgame in voice.

Voice: same sheet, no hands

Nothing about the protocol changes for voice — that is the point. The signals are identical; the difference is how each side performs its role.

When the core answers CONFIRM_CUSTOMER_INFO, the result carries the name, email, and phone number for the agent to speak. When the customer wants a change, there's no keyboard, so the form fills by voice: the agent calls collect_user_info(update) with each field as the customer says it, then confirm to save and close. On our web and app surfaces, the customer actually watches the on-screen form populate as they talk — same form the text flow uses, different hands on it.

That gives us three regimes of one protocol:

Text — the customer types into the form and taps Confirm.
Voice with a screen — the agent fills the form by update calls while the customer watches.
A phone call — no screen exists; show renders nothing, and the whole exchange is spoken read-back and correction.

The third regime deserves a caveat, because it's where "the model never asks in free text" stops being literally true. On a phone call, collection genuinely is free-form — the agent asks for the email out loud, and the customer says it. What stays structured is the capture: each spoken answer is immediately pushed through collect_user_info(update), field by field, validated and normalized server-side, then read back for correction. The channel changes how the question is asked; it never changes where the answer lives. The transcript is not the database.

One choreography sheet, three stagings. And because the voice prompt is static, the supplier-side agent configuration never needs to know who is calling — the caller's identity resolves server-side and rides the tool plumbing.

Inside the core

The decisions the core keeps for itself are where most of the design lives.

Trust is gated by provenance, not presence. When the core resolves contact information, it doesn't just ask Do I have it? — it asks Where did it come from? An explicit argument means the model was told this turn. A conversation session means the customer entered it minutes ago, in this very chat — don't insult them by re-asking. A record on file means it's from a previous visit — read it back and confirm before spending it. Same three fields, three different levels of consent. The presence of data is not permission to use it.

Earned trust has a number. Re-confirming contact on every order would punish regulars with friction. So there's an explicit bar: if a customer's three most recent orders carry identical contact details — compared after normalization, because PAT@x.com and (555) 010-3985 must match pat@x.com and 5550103985 — the confirmation step is skipped. Below the bar, we ask. The threshold isn't clever; the point is that it's explicit, and a unit test can hold it in place.

Evidence gets consumed. One subtlety earned its scars: that "entered this conversation" session is an evidence token — it vouches that the customer just told us who they are. We shipped a version where the token outlived the order it vouched for, and a second order in the same conversation sailed through with no contact interaction at all, riding the first order's evidence. The fix: a successful order spends the token (a failed one keeps it, so retries don't re-ask). Note that spending it never sends the customer back to a blank form — by then their details are on file, so the next order in the same chat gets the one-line read-back, not re-entry. Ten lines and two tests — and it was findable in one debugging session precisely because there was exactly one place to look.

An order is claimed, not just created. Submission begins with an atomic claim on the cart, so a timed-out request that gets retried replays the same order instead of creating a twin. The model can be as retry-happy as it likes; the core is idempotent.

Part IV: Two alternatives, for comparison

Alternative 1: the workflow wizard

The traditional answer: a fixed state machine — cart → details → contact → confirm → pay — with the LLM reduced to narrating each step. It's deterministic and safe, and it's how most checkout flows were built for twenty years.

It dies on contact with conversation. A customer who asks "wait, does that come in a larger size?" mid-checkout is off-script; the wizard has no state for it. "Use my work email instead" is an edit the flow didn't anticipate. And the wizard is coupled to its medium — a form-based state machine has no natural phone-call rendering, so each channel grows its own variant, and you're back to N copies drifting.

Alternative 2: the free-form agent

The fashionable answer: give the model the tools and the conversation, and let it decide. Prompt it with guidelines — "always confirm contact details before ordering" — and trust it.

The conversation is genuinely wonderful. The decisions are not. The model fills in an email it saw two hundred messages ago. It announces "your order is placed!" one tool-call too early. It confirms contact details on Tuesday and forgets to on Thursday, because guidelines in a prompt are sampled, not executed. There's no idempotency, because retries are a concept the model doesn't reliably hold. And you can't write a unit test for behavior that lives in sampling temperature — every prompt tweak becomes a release risk for your revenue path.

Side by side

	Workflow wizard	Free-form agent	Protocol over a core (ours)
Conversation quality	Scripted, brittle	Natural	Natural
Decision safety	High	Sampled — varies	High (decisions are never sampled)
Channel portability	One wizard per channel	High, but each channel drifts	One core, thin adapters
Testability	High	Poor	Core is pure functions under ~2,000 unit tests

Our approach aims to preserve the wizard's decisions and the agent's conversation. The price is real: someone has to design the protocol — enumerate the outcomes, define the flags, decide what rides in results versus prompts. That design work is exactly what the other two approaches let you skip, and exactly why they fail.

Part V: What generalizes

If you're building an agent that transacts — orders, bookings, payments, anything with consequences — the shape of the argument travels:

Put decisions in code and conversation in the model. The boundary between them is the most important interface in your system; make it explicit.
Typed outcomes are that interface. The model should react to signals, not parse prose or improvise state.
Keep prompts static; let data ride the tool results. You may be forced into this by frozen voice configs, as we were — but it's worth doing anyway. It's what makes one protocol portable across channels.
Gate trust on provenance and recency. That you have a customer's details and that you may use them are different facts.
Give trust thresholds explicit numbers, so tests can pin them down.
Single-use evidence must be consumed. Anything that vouches for "the customer just did X" needs to be spent when it's used.
All of the above should be pure functions. Ours sit under a couple of thousand unit tests; the conversational layer changes weekly, and the decision layer doesn't flinch.

The quiet payoff shows up later. The same choreography sheet runs a chat widget and a live phone line, unchanged. And when a real bug appeared in the trust logic, the fix was ten lines — not because we're careful, but because there was one place to be careless in.