Gemini 3.5 (Speed) vs. Gemini Ultra (Memory) — Google's Two-Track Model Strategy Explained
Leaked: Gemini 3.2/3.5 optimized for speed, Gemini Ultra going deep on memory and long-context. Here's what Google's two-track model strategy means for…
Google Is Building Two Different Geminis — and You Need to Know Which One You’re Betting On
Gemini 3.2/3.5 is reportedly optimized for speed and efficiency. Gemini Ultra is evolving into a memory-heavy, long-context system for multi-step workflows. These are not the same product with different price tags. They represent two fundamentally different theories about what an AI model is for — and if you’re building on top of Google’s stack, the distinction matters more than any benchmark number.
The leaks surfacing ahead of Google IO (a few weeks out from the May 6, 2026 publication date of the source material) paint a picture of a company that has stopped trying to win with a single flagship model and started segmenting its model lineup the way enterprise software companies segment their product tiers. Not by capability alone, but by use pattern.
You’ve seen this before. It’s how databases split into OLTP and OLAP. It’s how cloud compute split into spot instances and reserved capacity. The underlying resource is the same; the optimization target is different. Google appears to be doing the same thing with inference.
The Two Dimensions That Actually Separate These Models
Before comparing Gemini 3.2/3.5 against Gemini Ultra’s new direction, it’s worth being precise about what “speed” and “memory” mean in this context — because both terms get used loosely.
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
Latency and throughput. Speed in model terms usually means time-to-first-token and tokens-per-second. A model optimized here is one you’d put in a user-facing interface where a 3-second wait feels broken. It’s also the model you’d use in a high-volume pipeline where cost-per-call compounds fast.
Context retention and recall. Memory in the Gemini Ultra sense isn’t RAM. It’s the model’s ability to maintain coherent state across a long session — to remember that three exchanges ago you established a constraint, and to honor it now. The leaked “Team Food” feature is specifically aimed at improving how Gemini uses past chats and long-term context, which suggests this is about persistent memory across sessions, not just a longer context window.
Context window size. Related but distinct. A long context window lets you fit more tokens. Memory features determine whether the model actually uses them well. These are separate engineering problems, and Google appears to be investing in both on the Ultra side.
Workflow depth. Single-turn tasks (generate this, summarize that) reward speed. Multi-step workflows — where the model needs to track a goal across many actions, remember intermediate outputs, and maintain consistency — reward memory and context fidelity. Gemini Ultra’s reported evolution toward “consistent multi-step workflows” is a direct signal about which use case Google is optimizing for.
Cost structure. Speed-optimized models are typically cheaper per call. Memory-heavy models with long context windows cost more — both in compute and in the engineering required to make them reliable. This isn’t a flaw; it’s a feature of the segmentation.
Gemini 3.2/3.5: The Case for Speed
The leaked positioning of Gemini 3.2 and 3.5 as faster and more efficient models fits a clear market logic. Google already has Gemini Flash variants that compete on price-performance. The 3.2/3.5 line, if the leaks hold, appears to be the next iteration of that philosophy applied to the full Gemini generation.
What does “faster and more efficient” mean in practice? Likely a combination of architectural optimizations, smaller parameter counts in the active path (possibly mixture-of-experts style routing), and inference-level improvements. The result is a model you can call frequently, cheaply, and with low latency.
This is the model you’d use for real-time features. Autocomplete. Inline suggestions. Classification pipelines running at scale. Anywhere the user is waiting, or anywhere you’re making thousands of calls per hour and the bill matters.
There’s also a quality consideration worth naming. The source material flags that current Gemini models feel “lazy” — reluctant to produce long outputs, prone to truncating responses. If 3.2/3.5 addresses this while maintaining speed, that’s a meaningful improvement. If it doesn’t, you’re trading quality for throughput in a way that limits the use cases.
The Nano Banana integration already visible in Google AI Studio is a preview of this direction. It generates custom image assets for apps as they’re being built, with a redesigned edit tool for visual component control. It’s fast, it’s integrated, it’s useful in a real-time workflow. The limitation — no native transparency support, unlike Codex’s image generation — is exactly the kind of tradeoff you accept when you optimize for speed and integration over completeness.
Hire a contractor. Not another power tool.
Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.
For builders thinking about where to deploy a speed-optimized Gemini model, the comparison to GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro benchmarks is instructive — Gemini 3.1 Pro already showed competitive throughput characteristics, and 3.2/3.5 appears to push further in that direction.
Gemini Ultra: The Case for Memory
The more interesting strategic bet is what Google is reportedly doing with Gemini Ultra.
The leaked description — “memory-heavy long-context system for consistent multi-step workflows” — is a direct response to a real problem. Current LLMs, including Gemini, are stateless by default. Every session starts fresh. Every long document has to be re-fed. Every constraint you established in a previous conversation has to be re-established. This is fine for simple tasks. It’s a serious limitation for anything that looks like a real workflow.
The “Team Food” memory feature is the mechanism. The name is a codename, not a product description, but the function is clear: improve how Gemini uses past chats and long-term context. This is persistent memory — the model knowing, across sessions, what you’ve told it, what you’ve built, what constraints are in play.
Pair that with a long-context window that actually works well (not just technically supports many tokens, but uses them coherently), and you have a model that can serve as a genuine workflow partner rather than a stateless tool.
This is where the comparison to other long-context approaches becomes relevant. The SubCube architecture mentioned in the same source material claims a 12 million token context window — 12 times the size of current leading models — using a sparse attention architecture that’s reportedly 52 times faster than flash attention. If that holds up under scrutiny, it changes the economics of long-context inference significantly. Google’s Ultra direction is betting on memory and context fidelity; the question is whether architectural innovations from elsewhere make that bet obsolete or more valuable.
The Google DeepMind paper on diffusion latent trade-offs — described as “a map to navigate the trade-off systematically between latent information content and reconstruction quality” — is a separate signal, likely pointing toward VO4 and video generation improvements. But the underlying research theme is consistent: Google is investing heavily in the quality end of the quality-speed spectrum, not just the speed end.
For builders working on agentic systems, this matters. Platforms like MindStudio that support 200+ models and 1,000+ integrations let you chain models and tools visually — which means you can route fast tasks to a speed-optimized model and deep, stateful reasoning to a memory-heavy one, without rewriting your orchestration layer every time Google ships a new variant.
What This Means for the Broader Competitive Picture
Google’s two-track strategy is a direct response to the competitive landscape, not just an internal product decision.
OpenAI has GPT-5.5 and Codex running agentic loops that can persist for hours or days (the /goal command in Codex is exactly this — long-running tasks with autonomous tool use). Anthropic has Claude with extended thinking and a strong reputation for following complex instructions across long sessions. Both competitors have effectively staked out the “deep reasoning, long context” territory.
Google’s answer appears to be: we’ll compete on both ends. Gemini 3.2/3.5 for the high-volume, low-latency use cases where OpenAI’s pricing and latency are vulnerabilities. Gemini Ultra with Team Food for the multi-step workflow use cases where Claude currently has an edge.
The Arena blind tests are a useful signal here. Codenames appearing in testing — Ajax, Hercules, Hector, Orpheus — suggest multiple models in evaluation simultaneously. One commenter flagged that Ajax may actually be an Apple model, not a Google one. If true, that’s a reminder that the competitive field isn’t just the named labs. Apple shipping a capable on-device model would change the calculus for speed-optimized inference in particular.
The leaked Omni model — hinted at via a “video UI powered by Omni” reference — adds another dimension. If Google ships a model with native audio input and output (the way GPT-4o was supposed to work before OpenAI locked most of it down), that’s a multimodal capability that neither the speed track nor the memory track fully addresses. It’s a third axis.
XAI’s new voice cloning model, already live in the Groq voice API with no enterprise plan required, is a reminder that the voice modality is moving fast independently of the major model releases. A demo showed a clone so convincing that a public poll was nearly 50/50 on which voice was real. Google dropped their own “very instructable” voice model around the same time. Voice is becoming a commodity capability, which means the differentiation moves to the model layer above it — exactly where Gemini Ultra’s memory features would matter most.
Which Track to Build On
The honest answer is that you probably need both, and the question is which one to default to.
Use Gemini 3.2/3.5 if your primary constraint is latency or cost. Real-time user-facing features, high-volume classification or extraction pipelines, anything where you’re making thousands of calls and the per-call economics matter. Also use it if your tasks are genuinely single-turn — you don’t need memory if you’re not doing multi-step work.
Use Gemini Ultra (with Team Food, when it ships) if your workflow requires state across sessions. Research assistants that remember your prior work. Coding agents that maintain context about your codebase across multiple sessions. Any agent that needs to track a goal over time and honor constraints established earlier. This is also the right choice if you’re building something where consistency matters more than speed — where a wrong answer that contradicts something established three sessions ago is worse than a slow answer.
The harder question is what to do right now, before these models ship. The leaks are credible but unconfirmed. Google IO is weeks away. Building a production system on leaked model behavior is a bad idea.
What you can do is architect for the split. Design your system so that the model choice is a configuration parameter, not a structural dependency. If you’re building a workflow that has both fast, stateless tasks and slow, stateful ones, make sure those are separate calls to separate models — not a single monolithic prompt that tries to do both.
How Remy works. You talk. Remy ships.
This is where tools like Remy become relevant: when you’re speccing out a full-stack application, the spec itself can encode which tasks require stateful context and which don’t, and the compiled output can route accordingly. The spec is the source of truth; the model routing is derived from it.
For teams evaluating open-weight alternatives while waiting for Google’s announcements, the Gemma 4 vs Qwen 3.5 comparison covers the speed-vs-context tradeoff in the open-weight space — a useful reference point for understanding what Google is likely benchmarking against internally. Similarly, the Anthropic vs OpenAI vs Google agent strategy breakdown covers how each lab’s architectural bets translate into agent behavior, which is directly relevant to the Ultra memory story.
The Strategic Read
Google’s two-track model strategy is the right call. The mistake most AI labs make is trying to build one model that wins everywhere. That’s not how mature markets work. Enterprise software doesn’t have one database. Cloud compute doesn’t have one instance type. The AI model market is following the same pattern, and Google is early in explicitly acknowledging it.
The risk is execution. Speed-optimized models are only valuable if they’re actually fast and good enough. Memory-heavy models are only valuable if the memory actually works — if Team Food genuinely improves long-context coherence rather than just adding a feature flag that technically stores past chats but doesn’t use them well.
Google IO will tell us how much of this leaked roadmap is real. But the strategic logic is sound regardless of the specific model names. The question for builders isn’t which Gemini model is better. It’s which problem you’re solving — and whether you’ve designed your system to use the right tool for each part of it.