Walmart's ChatGPT Checkout Test Converted 3x Worse Than Its Own Site — What That Means for Agent Commerce
Walmart's AI checkout pilot flopped. The data reveals why agent-mediated buying requires a completely different commercial architecture.
Walmart’s ChatGPT Checkout Converted 3x Worse Than Its Own Site
Walmart ran a real test. ChatGPT offered instant checkout directly inside the chat interface. Shoppers could buy without ever leaving the conversation. By the numbers, it failed: the Walmart ChatGPT instant checkout test converted three times worse than simply sending shoppers back to Walmart’s own website. Daniel Danker, who oversees product and design at Walmart, called the experience “unsatisfying.” That’s a careful word choice for what was probably a more colorful internal conversation.
This isn’t a story about ChatGPT being bad at commerce. It’s a story about a structural mismatch — about what instant checkout was designed to do versus what buyers actually need when they’re shopping.
OpenAI acknowledged it directly. The initial version of instant checkout “did not offer the flexibility wanted,” so the team pivoted: let merchants handle their own checkout experiences, and have ChatGPT focus on product discovery instead. That’s a significant course correction, and it tells you something important about where the real value is in agent-mediated commerce.
What the Test Actually Revealed
The failure wasn’t a UI problem. You can’t A/B test your way out of a structural mismatch.
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
When a shopper is on Walmart’s own site, they have access to their cart, their loyalty program, their saved addresses, their substitution preferences, their delivery windows, and their purchase history. They can bundle items, apply coupons, and see whether the store brand is in stock. The checkout flow is embedded in a commercial context that Walmart has spent years building.
When that same shopper tries to buy a single item through a ChatGPT chat window, all of that context disappears. They’re not checking out — they’re executing a stripped-down transaction with none of the surrounding infrastructure that makes buying feel complete.
That’s what “unsatisfying” means. It’s not that the button didn’t work. It’s that the experience was missing everything that makes a purchase feel like a purchase rather than a vending machine interaction.
The Stripe Sessions announcements from this year are worth reading alongside this failure, because they point at the same diagnosis from the infrastructure side. Stripe announced a full agentic commerce suite: Link wallet for agents, shared payment tokens, a machine payments protocol, Radar’s token theft defenses, and streaming payments via Metronome and Tempo. That’s not a list of checkout improvements. That’s a list of what has to exist before agent-mediated commerce can work at scale.
Why This Matters If You’re Building Agents
If you’re building AI agents that touch any kind of commercial transaction — purchasing, booking, provisioning, reordering — the Walmart result is a useful calibration point.
The tempting mental model is: agent sees intent, agent executes purchase, done. That model works for commodity reorders. Paper towels. HDMI cables. Domain renewals. Things where the buyer has no preferences beyond “cheapest acceptable version.”
It breaks down immediately when the purchase is embedded in a relationship. Walmart’s shoppers have loyalty programs. They have delivery expectations. They have substitution rules. They have bundled carts. None of that travels through a chat window checkout flow.
The same problem appears in B2B contexts. If you’re building an agent that provisions cloud services, manages vendor relationships, or handles recurring purchases, the agent needs access to the buyer’s full commercial context — not just a payment method. A scoped virtual card (what Stripe calls a one-time use card) is a useful adapter for the existing web, but it doesn’t carry the buyer’s preferences, history, or constraints. It just carries authorization to spend money.
This is why the OpenAI pivot toward product discovery makes more sense than it might look at first. Discovery is where agent value is highest right now. An agent that can translate “authentic Ethiopian coffee” into a precise purchasing brief — origin, roast level, processing method, freshness window, roaster reputation — is genuinely useful. An agent that replaces the checkout page without replacing the surrounding commercial context is just a worse checkout page.
For builders working on agentic workflows and orchestration, this distinction matters a lot. The agent’s job in commerce isn’t to own the transaction. It’s to own the intent translation and the discovery, then hand off to the merchant’s existing infrastructure at the right moment.
The Non-Obvious Part: Payment Authority Is Moving, But Context Has to Move With It
Here’s what’s buried in the Walmart story that most coverage missed.
One coffee. One working app.
You bring the idea. Remy manages the project.
The failure wasn’t about payment. Stripe’s Link wallet for agents, shared payment tokens, and the machine payments protocol are real infrastructure that will work. The payment part of agent commerce is largely a solved problem, or at least a solvable one. Stripe and OpenAI co-developed the agentic commerce protocol specifically to address it. Visa and Mastercard are building agent payment token systems. PayPal is building commerce services around wallet trust and merchant protection. The payment rail is getting built.
What isn’t solved — and what the Walmart test exposed — is commercial context portability.
In the old checkout model, payment authority is extracted inside the seller’s flow. The buyer arrives, browses, builds a cart, and then authorizes payment. The seller’s environment is where intent and payment instrument finally meet. The seller gets to shape the experience, surface substitutions, apply loyalty discounts, and confirm delivery preferences.
In the agent model, payment authority travels with the task. The buyer’s agent arrives with a purchasing brief and a scoped payment credential. But if the seller’s commercial context — the cart, the loyalty program, the delivery preferences — doesn’t travel with it, the transaction is incomplete. You’ve moved the wallet without moving the relationship.
This is why the distinction between one-time use cards and shared payment tokens matters beyond the technical details. A one-time use card is, as the source puts it, “an adapter for the existing commercial internet.” It lets an agent buy from the web as it exists today. A shared payment token points toward a machine-native world where the seller can accept a scoped payment credential programmatically — and potentially receive the buyer’s full commercial context alongside it.
Walmart’s test used the adapter. The adapter worked technically. The experience failed commercially because the adapter doesn’t carry the relationship.
The Fraud Problem Nobody Mentions in the Commerce Conversation
There’s another layer here that’s easy to skip past in the excitement about agent buying.
Stripe’s Radar announcement at Sessions wasn’t just a fraud update. It was a response to a specific new attack pattern: a few thousand humans running millions of agents to register fraudulent accounts and steal tokens. That’s already happening at scale.
In traditional SaaS, a fraudulent free-trial user was nearly free. One more person clicking around the product didn’t generate meaningful costs. In AI SaaS, one fraudulent free-trial user costs real money, dollar for dollar in compute. They’re not clicking around — they’re burning tokens. Every fraudulent session is a direct transfer from the company’s compute budget to the attacker.
This changes the economics of agent commerce significantly. If you’re building an agent that can purchase on behalf of users, you’re also building a surface that fraudsters can target. The agent’s payment credentials, the merchant’s inventory, the platform’s free-tier compute — all of it becomes a target.
Stripe Signals extends risk information beyond direct Stripe transactions, which means the network can see payment behavior, business behavior, signup behavior, and agent behavior across a large portion of the internet economy. That’s the right level of defense for a threat that operates at agent scale. But it also means that merchants and platform builders need to think about fraud as an agent-native problem, not just a human-fraud problem with more volume.
Remy doesn't build the plumbing. It inherits it.
Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.
Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.
For anyone building commerce-adjacent agents, this is a concrete design constraint. The agent needs fraud-resistant payment credentials. The merchant needs to distinguish real purchasing attempts from automated abuse. The platform needs to meter usage tightly enough that fraudulent compute consumption is detectable before it becomes catastrophic.
Platforms like MindStudio handle some of this orchestration at the infrastructure level — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — but the fraud surface is ultimately a function of what your agent is authorized to do and how tightly those authorizations are scoped.
What the Brand Shift Actually Means for Builders
The Walmart failure also points at something that’s easy to underrate: brand in an agent-mediated world works completely differently.
In the seller-controlled web, brand does its work at the point of persuasion. The buyer lands on the site, absorbs the design, reads the copy, sees the social proof, and decides whether the company deserves their money. The seller gets to perform the brand every time a buyer arrives.
In the agent-mediated web, brand becomes part of the buyer’s memory. The agent carries the buyer’s preferences, prior purchases, trust history, and stated dislikes as inputs to its decision-making. The agent doesn’t feel brand loyalty, but it can carry brand loyalty as a constraint. It can know that a buyer avoids a particular airline, prefers a specific coffee roaster, or distrusts a marketplace that broke their trust once.
The seller doesn’t get to reset that conversation. There’s no landing page that overcomes a bad prior experience when the agent is making the decision.
This is a real shift for anyone building commerce-facing products. The old brand question was roughly: how do we make the buyer feel something? The new brand question is: how do we become the kind of business the buyer’s agent remembers as a good answer?
That’s not a marketing problem. It’s a data quality problem, a policy clarity problem, a fulfillment consistency problem, and a support reliability problem. The agent needs structured information to reason against. It needs the final price, the delivery window, the return policy, the payment options, the inventory, and the constraints — all of it explicit, all of it accurate.
If you’re building agents that help users shop or procure, the quality of the merchant’s data surface is a direct constraint on the agent’s usefulness. A merchant with clean product catalogs, explicit policies, and consistent fulfillment is a merchant the agent can represent accurately. A merchant with vague copy and buried terms is a merchant the agent will either skip or misrepresent.
This is also where the spec-driven approach to building commercial surfaces starts to matter. Tools like Remy take a similar philosophy to what agent-ready merchants need: you write a spec — annotated, precise, explicit about edge cases — and the full-stack application is compiled from it. The spec is the source of truth. For merchants, the equivalent is making their commercial reality explicit enough that an agent can act on it without inference.
The Competitive Picture Is Wider Than Stripe
It’s worth being clear that this isn’t only a Stripe story, even though Stripe’s Sessions announcements are the most architecturally coherent version of it.
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
Microsoft has pushed shopping inside Copilot. Meta is moving checkout closer to ads. Visa and Mastercard are building agent payment token systems. PayPal is building commerce services around wallet trust and merchant protection. Google’s universal commerce protocol is meant to work across discovery, buying, and post-purchase support, with merchant center attributes that go beyond traditional keywords.
The whole market is running toward the same place: commerce that begins inside the buyer’s interface, not the seller’s store.
The Walmart test is useful precisely because it’s a real data point in the middle of all this activity. It’s not a demo. It’s not a prototype. It’s a production test with real conversion data, and the conversion data said that instant checkout inside a chat window performed worse than sending buyers back to the merchant’s own site.
That result should recalibrate how you think about where agent value actually lives in the commerce stack. It’s not in owning the transaction. It’s in owning the intent — translating fuzzy human language into precise purchasing briefs, finding the right merchant, comparing options accurately, and handing off to the right transaction surface at the right moment.
The comparison between AI models for agentic tasks matters here too, because the quality of intent translation is a direct function of the model’s reasoning capability. An agent that can accurately translate “authentic Ethiopian coffee” into a purchasing brief with origin, processing method, and freshness constraints is doing something meaningfully different from an agent that just keyword-matches against a product catalog.
What to Watch and What to Build
The OpenAI pivot is the most concrete signal here. Moving from instant checkout to product discovery with merchant checkout is an acknowledgment that the value is in the discovery layer, not the transaction layer. That’s where to build.
For AI builders specifically, a few concrete watchpoints:
The machine payments protocol and shared payment tokens are early infrastructure. They’re not widely adopted yet, and the transition between one-time use cards (the adapter) and shared payment tokens (the native rail) will be uneven. Most commercial transactions will run on the adapter for a while. Build for that reality.
Streaming payments via Metronome and Tempo are relevant if you’re building AI products with usage-based billing. The timing mismatch between when compute costs are incurred and when customers pay is a real risk. Stripe’s dimensional pricing, hybrid pricing, commits, and real-time metering are answers to that problem. If you’re billing for AI consumption, this is worth understanding in detail.
The fraud surface is real and already active. A few thousand humans running millions of agents is not a future threat — it’s a current one. If your agent can authorize purchases or consume compute on behalf of users, the authorization scoping and fraud detection need to be first-class concerns, not afterthoughts.
And the brand question is worth sitting with. If you’re building a product that agents will recommend, compare, or purchase from, the question is no longer how you convert visitors. It’s whether your commercial reality is explicit enough for an agent to represent you accurately. Clean data, clear policies, consistent fulfillment. That’s the new SEO.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
The Walmart result is a useful anchor for all of this. It’s a reminder that moving fast on the obvious version of a thing — checkout inside the chat — can produce worse outcomes than the slower, more contextually complete version. The agent economy rewards completeness, not just speed.
For anyone building in this space, the question of what AI agents actually need to operate reliably — persistent memory, commercial context, trust history — is the right level of analysis. The checkout button was never the hard part.