Walmart's ChatGPT Checkout vs. Native Site: Why Agent Commerce Converted 3x Worse

Walmart Tried to Sell Inside ChatGPT. It Converted 3x Worse Than Just Sending People to Walmart.com.

Walmart ran a real experiment: let ChatGPT handle the checkout, or redirect shoppers back to Walmart’s own site. The redirect won — by a factor of three. Daniel Danker, who oversees product and design at Walmart, called the instant checkout experience “unsatisfying.” That’s a diplomatic word for a result that probably generated some very undiplomatic conversations internally.

You should care about this even if you’re not Walmart. The experiment is a clean natural test of a question every builder in commerce is going to face: where does the transaction actually belong in an agent-mediated buying journey? The answer has implications for how you architect your product, where you invest in checkout optimization, and whether the “buy inside the AI” pattern is a shortcut or a trap.

The result isn’t a verdict against agent commerce broadly. It’s a specific data point about a specific pattern — instant checkout inside a chat interface — that tells you something important about what agents are actually good at versus what they’re not.

What Walmart Was Actually Testing

The test was simple in structure. A shopper interacts with ChatGPT, expresses intent to buy something from Walmart, and either: (a) completes the purchase inside the ChatGPT interface, or (b) gets handed off to Walmart.com to finish the transaction there.

Option (b) won, decisively.

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

The framing most people reach for is “the checkout UX was bad” or “people don’t trust buying inside ChatGPT yet.” Both might be true. But the more structural explanation is that the instant checkout pattern misunderstands what a shopping cart actually is.

A cart is not just a list of items. It’s a bundle of context: loyalty points, saved addresses, substitution preferences, delivery windows, bundled promotions, return history, and a relationship with a merchant that the shopper has been building for years. When you move checkout into a chat window, you don’t just move the payment step — you sever the shopper from all of that accumulated context. You’re asking them to transact in a stripped-down environment that has none of the infrastructure they’ve come to rely on.

OpenAI’s own follow-up acknowledged this directly. The company said the initial version of instant checkout “did not offer the flexibility it wanted to provide,” and shifted toward letting merchants use their own checkout experiences while ChatGPT focused on product discovery. That’s not a minor product tweak. That’s a fundamental repositioning of where ChatGPT sits in the commerce stack.

The Structural Problem With Checkout Inside the Chat Window

There’s a pattern in software where you take something that works well in one context and try to embed it in another context because the embedding feels convenient. It usually fails for the same reason: the original thing worked because of its context, not despite it.

Walmart’s checkout works because it sits inside a surface that knows who you are. It knows your membership status, your saved payment methods, your delivery address, your order history, your preferred substitutions when something is out of stock. The checkout page is the last step in a long chain of trust-building and preference-capture.

ChatGPT’s instant checkout was trying to be that last step without any of the preceding chain. The shopper’s context — their loyalty program, their cart, their merchant relationship — lived somewhere else. The payment credential had to be re-entered or re-authorized. The return policy was abstract rather than embedded in a known relationship. The experience felt thin because it was thin.

This is also why the Stripe Links wallet for agents is architecturally interesting in a different way than instant checkout. With Stripe’s approach, a user grants programmatic access to Link, the agent creates a spend request, the user approves it, and Link returns either a one-time card or a shared payment token — the agent never sees raw credentials. That’s not trying to replace the merchant’s checkout surface. It’s trying to give the agent a safe way to interact with checkout surfaces that already exist. The adapter model, not the replacement model.

Discovery vs. Transaction: Two Different Jobs

The Walmart result points at a distinction that’s easy to blur: discovery and transaction are different jobs, and they have different optimal surfaces.

ChatGPT is genuinely good at discovery. A shopper can describe what they want in natural language — “I need a birthday gift for a 7-year-old who’s obsessed with space” — and get a useful, reasoned response that surfaces options they might not have found through keyword search. The agent can translate fuzzy human intent into a structured purchasing brief. That’s real value.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

But “I found the thing I want” and “I’m ready to complete this transaction” are different moments with different requirements. The discovery moment benefits from the agent’s ability to reason across options. The transaction moment benefits from the merchant’s ability to leverage accumulated context: saved preferences, loyalty status, delivery infrastructure, return policies, dispute resolution.

When you collapse both into the chat window, you get the worst of both: a discovery interface that’s also trying to be a transaction interface, without the context that makes transactions feel safe and complete. This dynamic shows up in model behavior too — the same reasoning capability that makes an AI useful for comparing options across complex tradeoffs doesn’t automatically translate into being a trustworthy transaction executor.

This is why OpenAI’s pivot makes sense. ChatGPT as a discovery layer that hands off to merchant checkout is a coherent product. ChatGPT as an end-to-end commerce surface is a much harder problem — one that requires rebuilding, inside the chat interface, all the context infrastructure that merchants have spent years accumulating.

What This Means for the Broader Agent Commerce Stack

The Walmart result doesn’t mean agent commerce is overhyped. It means the first obvious implementation of agent commerce — “put a buy button in the chat” — was the wrong first move.

The more interesting architecture, which Stripe’s agentic commerce suite is pointing toward, is about making merchant inventory, pricing, policies, and payment readiness legible to agents before the transaction surface ever comes into play. The merchant broadcasts their commercial reality into the surfaces where buyer intent is forming. The agent uses that information to do discovery and comparison. The transaction happens on the surface that’s best equipped to handle it — which, for most established merchants, is their own checkout.

Stripe’s machine payments protocol is a new primitive for agent-to-agent payment coordination — not a replacement for merchant checkout, but a way for agents to coordinate the payment step in workflows that don’t map cleanly onto human-facing checkout pages. Stripe Tempo handles stablecoin micropayments for streaming and per-token billing. Stripe Metronome handles precise usage tracking for AI token consumption. These are infrastructure pieces for transaction patterns that didn’t exist before agents — not replacements for the transaction patterns that already work.

The distinction matters for builders. If you’re building a commerce-adjacent agent, the question isn’t “should my agent handle checkout?” The question is “what part of the buying journey does my agent actually improve, and what part should I hand off to surfaces that are already optimized for it?”

If you’re building agents that need to orchestrate complex multi-step workflows — the kind that might involve discovery, comparison, authorization, and eventual purchase across multiple merchants — MindStudio gives you the building blocks to chain those steps visually: 200+ models, 1,000+ integrations, and an agent runtime that lets you define where the handoffs happen without writing the orchestration code from scratch.

The Brand Question the Walmart Test Raises

There’s a subtler implication in the Walmart result that’s worth sitting with. Walmart’s brand — the accumulated trust, the loyalty program, the delivery expectations, the return experience — was an asset that instant checkout couldn’t access. The shopper’s relationship with Walmart lived on Walmart.com, not in the chat window.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

This points at something important about how brand works in agent-mediated commerce. In the seller-controlled web, brand does its work at the point of persuasion: the design, the copy, the social proof, the checkout experience. The seller gets to perform the brand every time the shopper arrives.

In agent-mediated commerce, brand increasingly becomes part of the buyer’s memory — their preferences, their trust history, their stated loyalties — that the agent carries as a constraint. The agent doesn’t feel brand loyalty, but it can carry brand loyalty as a constraint. It can know that a particular shopper prefers a specific merchant, avoids a specific airline, or has had a bad experience with a specific marketplace. That’s not emotional persuasion. It’s a ledger entry.

The implication for merchants is that the work of building brand trust doesn’t go away — it just moves. The question shifts from “how do we make the buyer feel something when they arrive at our site?” to “how do we become the kind of business that ends up as a positive entry in the buyer’s agent’s operating context?” That’s a harder question to answer with a redesigned landing page.

The Fraud Dimension Nobody’s Talking About

One thing the instant checkout debate mostly ignores: the fraud surface that comes with any agent-native payment pattern.

Stripe’s Radar announcement is framed explicitly as a defense against a specific threat: a few thousand humans running millions of agents to steal tokens from AI products. That’s not a hypothetical. It’s already happening at scale.

The economics are different from traditional SaaS fraud. In a conventional SaaS product, one more free trial user was nearly zero marginal cost — they clicked around, maybe exported some data, and left. In an AI product, one more free user burns real compute. A fraudster running agents to steal tokens is consuming the company’s costs dollar-for-dollar. The free trial that used to be a cheap customer acquisition tool becomes a direct liability when the “user” is an automated agent running at scale.

This is one reason why the “put checkout in the chat” pattern is harder than it looks even if you solve the UX problems. You’re not just building a new transaction surface — you’re building a new fraud surface. Every place where an agent can initiate a payment is a place where a bad actor can try to initiate a fraudulent payment. Stripe Signals extends risk information beyond direct Stripe transactions, and Stripe Project gives merchants access to risk signals across the broader Stripe network. But that infrastructure takes time to mature, and the fraud patterns will evolve alongside the legitimate use cases.

What Builders Should Actually Do With This

The Walmart result is a useful corrective to the “agents will handle everything end-to-end” framing that’s been circulating. But it’s not a reason to ignore agent commerce. It’s a reason to be precise about where agents add value.

Use agents for discovery, not transaction completion. The evidence suggests agents are good at translating fuzzy intent into structured purchasing briefs, surfacing options the shopper wouldn’t have found through keyword search, and doing comparison across alternatives. They’re not yet good at replacing the transaction surface that merchants have spent years optimizing.

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

Make your commercial reality legible to agents. If your product catalog, pricing, policies, and fulfillment constraints aren’t structured in ways agents can reason about, you’re invisible to the discovery layer. This isn’t SEO for agents — it’s a higher bar. An agent needs to understand what you sell, when you’re relevant, and what the complete commercial path looks like. If you’re building the agent-facing layer of a commerce product, the spec-to-implementation gap matters: Remy takes a direct approach to this problem — you write an annotated markdown spec describing your data model and business rules, and it compiles a complete TypeScript backend, database, auth, and deployment from that spec. The spec is the source of truth; the generated code is derived output. That kind of structured, spec-driven approach is exactly what makes a commerce backend legible to downstream agents.

Design for handoff, not containment. The OpenAI pivot — ChatGPT handles discovery, merchants handle checkout — is the right architecture for now. Build your agent to hand off gracefully to the transaction surface that has the most context, not to contain the entire buying journey inside the agent interface.

Take the fraud surface seriously from day one. If you’re building any agent that touches payments or token consumption, the fraud economics are different from what you’re used to. One fraudulent agent isn’t one bad actor — it’s potentially millions of automated requests burning real compute costs. Design your authorization model accordingly.

The token-based pricing models that underpin most AI products today make this even more acute — every agent interaction has a real cost, which means the architecture of where agents hand off to humans or to other systems isn’t just a UX question, it’s a unit economics question. When a discovery session that should cost a few cents in inference bleeds into a botched transaction attempt that requires retry logic and fraud review, the unit economics deteriorate fast.

The Walmart experiment is one of the cleaner natural tests we’ve gotten of a core agent commerce question, and the answer it returned is specific and useful: agents are good at the part of commerce that happens before the transaction, not the transaction itself. The companies that internalize that distinction early will build better products than the ones still trying to put the buy button in the chat window.

It’s worth noting that the underlying model capabilities driving these agent interactions are advancing rapidly — recent benchmark comparisons across frontier models show meaningful differences in how well different models handle multi-step reasoning tasks, which directly affects how reliably an agent can navigate a discovery-to-handoff workflow without losing context or making errors that erode shopper trust.

That’s the actual lesson from Walmart’s 3x conversion gap. Not that agent commerce doesn’t work. That the first obvious implementation of it was solving the wrong problem.