What Is RTX Spark? NVIDIA and Microsoft's AI-First PC Chip Explained

A New Kind of AI Chip for a New Kind of AI Workload

Something meaningful shifted when NVIDIA announced RTX Spark. Not because it’s just another GPU or another mini PC — but because it represents a deliberate bet that serious AI work should happen on your desk, not in a cloud datacenter.

RTX Spark is NVIDIA’s compact AI supercomputer built around the GB10 Grace Blackwell Superchip, a single module that fuses a Blackwell GPU with a Grace ARM-based CPU. The result is a desktop device capable of running models with up to 200 billion parameters locally, with no internet connection required. For anyone building, deploying, or just thinking seriously about AI agents, that’s a notable shift in what’s physically possible outside a data center.

This article explains what RTX Spark actually is, how the architecture works, what NVIDIA and Microsoft are building together, and what it means practically for AI builders and enterprises exploring local inference.

What RTX Spark Actually Is

RTX Spark is a palm-sized desktop AI computer. NVIDIA originally previewed the underlying concept as Project DIGITS at CES 2025, and RTX Spark is the consumer and developer-facing product built around the same GB10 Grace Blackwell Superchip.

The form factor is deliberate. NVIDIA designed RTX Spark to sit on a desk — not in a rack, not in a cloud — and still deliver what the company calls “petaflop-class” AI performance. That’s the kind of compute you’d previously only get from enterprise server hardware.

A few key specs:

GPU: NVIDIA Blackwell architecture (GB10)
CPU: NVIDIA Grace (72-core ARM Neoverse-based processor)
Memory: 128GB unified LPDDR5X memory shared between CPU and GPU
Storage: NVMe SSD, up to 4TB
Connectivity: USB4, DisplayPort, Ethernet
Operating system: DGX OS (Linux-based), with full CUDA support

The unified memory pool is especially significant. In traditional PC setups, the GPU has its own VRAM and the CPU has its own system RAM. Moving data between them creates a bottleneck. RTX Spark eliminates that bottleneck — the Grace CPU and Blackwell GPU share the same 128GB pool, making large model inference dramatically more efficient.

The Grace Blackwell Architecture, Explained

To understand why RTX Spark matters, you need to understand the chip at its core.

What Is the Blackwell GPU?

Blackwell is NVIDIA’s latest GPU architecture, announced in early 2024. It’s designed for AI workloads — specifically transformer-based models, which power most modern large language models (LLMs). Blackwell introduces a new generation of Tensor Cores optimized for FP4 precision, which allows it to run much larger models at lower power.

In NVIDIA’s datacenter lineup, Blackwell shows up in the B100 and B200 GPUs. The GB10 in RTX Spark is a scaled-down but architecturally equivalent version, purpose-built for edge and desktop deployment.

What Is the Grace CPU?

Grace is NVIDIA’s in-house ARM CPU. Unlike x86 CPUs from Intel or AMD, Grace is built on ARM’s Neoverse architecture — the same foundation used in cloud servers from Amazon (Graviton) and Ampere Computing.

The key advantage of pairing Grace with Blackwell is bandwidth. The two chips are connected via NVLink-C2C, NVIDIA’s chip-to-chip interconnect, which provides 900 GB/s of bandwidth between CPU and GPU. That’s roughly 7x faster than what you’d get connecting a GPU to a standard x86 CPU over PCIe.

Why Unified Memory Changes Things

Most AI models — especially LLMs — are constrained by memory, not just compute. A 70-billion parameter model in 4-bit quantization requires around 35GB of memory. Running that on a typical gaming PC with 16GB of VRAM isn’t possible. RTX Spark’s 128GB unified pool makes it feasible to load and run models of that scale locally, without any GPU memory swapping or quantization compromises.

Two RTX Spark units can also be connected to form a 256GB shared memory system, enabling models up to 200 billion parameters — territory previously reserved for multi-GPU server racks.

The Microsoft Partnership: What’s Actually Being Built

Microsoft and NVIDIA’s collaboration on RTX Spark goes beyond a standard hardware announcement. Microsoft is building Windows-native AI integrations designed to take advantage of the chip’s architecture.

Windows AI Foundry

Microsoft has been developing what it calls Windows AI Foundry, a framework that lets developers run and fine-tune models locally on Copilot+ PCs and high-performance AI hardware like RTX Spark. The integration means AI apps on Windows can route inference requests directly to the Grace Blackwell chip without going through cloud APIs.

This has real implications:

Latency drops significantly — local inference can be 10–50x faster than round-tripping to a cloud API
Privacy improves — sensitive data never leaves the device
Cost structure changes — no per-token API charges for on-device inference

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

NVIDIA NIM and Local Agents

NVIDIA’s NIM (NVIDIA Inference Microservices) are packaged AI model containers optimized for NVIDIA hardware. RTX Spark ships with support for NIM, meaning developers can pull down a containerized LLM or embedding model and have it running locally in minutes, with performance tuned for the GB10 chip.

This is particularly relevant for AI agent workloads. An agent that needs to call an LLM repeatedly (for planning, reasoning, tool selection) benefits enormously from local inference — each call is faster, cheaper, and doesn’t depend on external uptime.

Microsoft’s Copilot system is also being extended to allow local model routing on devices that meet hardware thresholds, with RTX Spark sitting at the high end of that tier.

Why Local AI Matters for AI Agents

AI agents aren’t just chatbots. They’re systems that reason through problems, call tools, retrieve context, make decisions, and take actions — often across many sequential steps. Each step typically involves an LLM call.

In a cloud-dependent setup, every one of those calls goes over the network. That adds latency, costs money (per token), and creates a dependency on API availability. For agents that need to act quickly or handle sensitive data, those constraints matter.

RTX Spark changes the math:

Speed: Local inference at 1,000+ tokens per second for 70B models means agents can reason and respond in near real-time
Privacy: Medical records, financial data, internal documents — none of it needs to leave the building
Cost: Once you own the hardware, inference is effectively free at the margin
Reliability: No rate limits, no API downtime, no token quotas

For enterprise use cases — compliance-heavy industries, secure government environments, latency-sensitive applications — local AI infrastructure isn’t a nice-to-have. It’s a requirement.

Who RTX Spark Is For

RTX Spark isn’t a consumer gaming PC. It’s not competing with GeForce RTX laptops for gaming performance. The target audience is more specific:

AI Researchers and Developers

Researchers who need to fine-tune, test, and iterate on large models benefit from having dedicated, high-memory AI hardware without cloud bills. RTX Spark lets them run experiments locally at a speed that cloud spot instances often can’t match for interactive work.

Enterprise IT and Secure Environments

Organizations in healthcare, finance, legal, and government often can’t use cloud AI APIs for sensitive workloads. RTX Spark offers a path to deploying capable AI infrastructure that stays entirely on-premises.

AI Application Builders

Developers building agent-based applications — document processors, internal knowledge bases, customer support systems — can host the inference layer locally and build lightweight frontends that call into it. The NIM integration makes this relatively straightforward.

AI-First Workstations

Creative professionals, data scientists, and ML engineers who currently rely on cloud GPU rentals for inference have a compelling alternative. At roughly $3,000 (NVIDIA’s projected price), RTX Spark becomes cost-competitive with cloud inference costs within months for heavy users.

RTX Spark vs. Other Local AI Hardware

RTX Spark doesn’t exist in a vacuum. Several other options target the local AI inference market:

Device	Memory	Architecture	OS	Target
RTX Spark (GB10)	128GB unified	Grace + Blackwell	DGX OS (Linux)	Developers, enterprise
Apple Mac Studio (M4 Ultra)	Up to 192GB unified	ARM + Apple Neural Engine	macOS	Creative pros, developers
Intel Core Ultra AI PCs	32–96GB (system RAM)	x86 + NPU	Windows	General business use
AMD Ryzen AI PCs	32–64GB	x86 + XDNA NPU	Windows	Business, prosumer

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

Apple Silicon remains the primary competitor for developer-friendly local inference. The M4 Ultra Mac Studio has a larger memory ceiling (192GB) and excellent macOS tooling. But RTX Spark has a meaningful advantage in CUDA compatibility — the entire NVIDIA AI software ecosystem, including PyTorch CUDA acceleration, NIM containers, and TensorRT optimization, works natively. That matters for teams already invested in NVIDIA tooling.

How MindStudio Fits Into This Picture

RTX Spark makes local AI inference practical at a new scale. But most teams building AI agents aren’t starting from scratch with raw model inference — they need the orchestration layer on top: workflow logic, integrations with business tools, multi-step reasoning, and interfaces that non-technical users can actually work with.

That’s where MindStudio comes in.

MindStudio is a no-code platform for building AI agents and automated workflows. You can connect to 200+ AI models — including locally-hosted models via Ollama or LMStudio — and build agents that reason across multiple steps, connect to tools like Slack, HubSpot, Google Workspace, and Salesforce, and run on a schedule or in response to triggers.

For teams deploying RTX Spark in an enterprise environment, MindStudio handles the orchestration layer that sits above the inference hardware. Your local model handles the sensitive reasoning; MindStudio handles the workflow logic, integrations, and interfaces. It’s a practical split: keep sensitive inference on-device, use MindStudio for everything else.

MindStudio also supports building autonomous background agents that can run on schedules — useful for document processing, compliance checks, or summarization tasks that benefit from local inference pipelines feeding into automated workflows.

You can try MindStudio free at mindstudio.ai without any setup or API keys required.

Frequently Asked Questions

What is RTX Spark?

RTX Spark is a compact desktop AI computer from NVIDIA, built around the GB10 Grace Blackwell Superchip. It combines a 72-core Grace ARM CPU and a Blackwell-architecture GPU in a single chip, sharing 128GB of unified LPDDR5X memory. It’s designed to run large AI models locally, including LLMs with up to 200 billion parameters when two units are connected.

How is RTX Spark different from a regular gaming GPU?

A gaming GPU (like a GeForce RTX 4090) has its own dedicated VRAM — typically 24GB — separate from system RAM. RTX Spark uses a unified memory pool shared between CPU and GPU, giving it far more usable memory for AI workloads. It also runs on an ARM CPU (Grace) instead of an x86 processor, and the chip-to-chip interconnect delivers substantially higher bandwidth. RTX Spark is purpose-built for AI inference, not rendering or gaming.

Can RTX Spark run AI agents without internet?

Yes. That’s a primary design goal. With models loaded locally via NIM containers or tools like Ollama, RTX Spark can run fully air-gapped AI workloads. An AI agent can reason, retrieve context, and execute tasks entirely on-device, with no API calls to external services.

What models can RTX Spark run?

RTX Spark can run any model that fits within its 128GB memory pool. In practice, that includes:

Llama 3.1 70B (and similar 70B models)
Mistral Large, Qwen 2.5 72B
Gemma 3 27B
Most open-source models up to ~70B at full precision, or larger models at 4-bit quantization
Two units connected can run 200B+ parameter models

NVIDIA provides optimized NIM containers for many of these models, tuned specifically for the Grace Blackwell architecture.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

What is the Microsoft partnership doing for RTX Spark?

Microsoft and NVIDIA are integrating RTX Spark capabilities into Windows AI Foundry, allowing Windows applications to route inference directly to the GB10 chip. This means Windows developers can build AI features that run locally without cloud API dependencies. Microsoft’s Copilot system is also being extended to support local model routing on high-performance hardware, with RTX Spark sitting at the top tier.

How much does RTX Spark cost?

NVIDIA has positioned RTX Spark (Project DIGITS) at approximately $3,000. That’s significantly less than comparable cloud GPU instances for sustained inference workloads — a single A100 GPU instance can cost $3–8/hour, meaning the hardware pays for itself within months for heavy inference users.

Key Takeaways

RTX Spark fuses a Blackwell GPU and Grace ARM CPU on a single chip (GB10), sharing 128GB of unified memory — enabling local inference at a scale previously only possible in datacenters
The Microsoft partnership through Windows AI Foundry means local AI inference integrates directly into Windows-native applications, without cloud API dependencies
RTX Spark is primarily aimed at AI developers, researchers, and enterprises in regulated industries where data privacy and latency matter
Two units can be linked to run 200B+ parameter models, approaching small datacenter capability on a desktop
Local AI hardware like RTX Spark handles the inference layer; platforms like MindStudio handle the orchestration, workflow, and integration layer on top — the two are complementary, not competing

If you’re building AI agents and want to connect local inference with automated workflows and business tool integrations, MindStudio is worth exploring — it’s free to start, and the average workflow takes less than an hour to build.

What Is RTX Spark? NVIDIA and Microsoft's AI-First PC Chip Explained

A New Kind of AI Chip for a New Kind of AI Workload

What RTX Spark Actually Is