Skip to main content
MindStudio
Pricing
Blog About
My Workspace

What Is the Google AI Edge Gallery? How to Run LLMs Offline on Your iPhone

Google AI Edge Gallery is a free iOS app that runs Gemma models fully on-device for offline speech-to-text and AI tasks. Here's how it works.

MindStudio Team RSS
What Is the Google AI Edge Gallery? How to Run LLMs Offline on Your iPhone

Running a Full AI Model in Your Pocket — Without Wi-Fi

Your phone is more powerful than the computers that sent humans to the moon. So why does it still need to phone home to a cloud server every time you want AI help?

Google AI Edge Gallery changes that. It’s a free app that runs large language models — specifically Google’s Gemma family — entirely on your iPhone or Android device, with no internet connection required. No data leaves your phone. No server roundtrips. Just local inference, running on the same chip that handles your photos and apps.

This article explains what Google AI Edge Gallery is, how it works, which Gemma models it supports, how to install and use it on iOS, and where it fits in the broader on-device AI picture.


What “On-Device AI” Actually Means

Most AI apps — ChatGPT, Gemini, Claude — work by sending your prompt to a remote server, processing it there, and streaming the response back. That’s cloud inference.

On-device inference flips the model. The AI model weights are downloaded once and stored locally. All computation happens on your device’s chip. No request leaves your phone.

This matters for several reasons:

  • Privacy — Your prompts, documents, and images never touch an external server.
  • Offline use — The model works in airplane mode, in a basement, or anywhere without reliable connectivity.
  • Latency — No round-trip to a server means responses can start generating immediately.
  • Cost — You’re not consuming API credits or paying per token after the initial download.

The tradeoff is model size. The most capable cloud models have hundreds of billions of parameters. On-device models need to fit within a phone’s RAM — typically 4–8 GB available for apps — so they’re smaller and less capable on complex reasoning tasks. But for many everyday tasks, they’re surprisingly good.


Google AI Edge Gallery is an open-source showcase app built by Google’s AI Edge team. It demonstrates what’s possible when you run Gemma models entirely on a mobile device.

Think of it as a testing ground, not a polished consumer product. It’s designed for developers, researchers, and technically curious users who want to experiment with on-device LLMs. Google uses it to show what their edge AI stack — built on LiteRT (formerly TensorFlow Lite) and MediaPipe — can do on real hardware.

The app is available on both iOS and Android, and the source code is publicly available on GitHub under an open-source license.

What It’s Not

It’s not Google’s main Gemini app. It’s not a replacement for ChatGPT. It’s an experimental tool that prioritizes transparency and control over polish.

There’s no account required. No sign-in. No telemetry by default. You download a model, and you run it locally.


The Gemma Models Behind It

Google AI Edge Gallery is powered by the Gemma family — Google’s series of open-weight models designed for both research and on-device deployment.

Gemma 3n

Gemma 3n is the flagship model for mobile use. Google engineered it specifically for edge deployment, using a technique called MatFormer architecture that lets the model adjust its effective size depending on available hardware. It handles text and image inputs, making it multimodal.

Key specs:

  • Available in 2B and 4B parameter configurations
  • Designed to run within 2–4 GB of RAM
  • Supports multilingual tasks across 35+ languages
  • Can process text and images natively

Gemma 2 2B

An earlier generation model, Gemma 2 2B is text-only but well-suited for devices with tighter memory constraints. It’s faster to load and generates tokens more quickly than larger models, making it a good starting point for basic Q&A and text tasks.

How Models Are Delivered

Models are downloaded inside the app using Kaggle or Google’s model hosting. Once downloaded, they’re stored on your device and available offline indefinitely. You only need internet access for the initial download.

Model files are typically 1–3 GB depending on the variant, so make sure you have storage available before downloading.


Here’s the step-by-step process for iOS.

Prerequisites

Before you start:

  • iPhone running iOS 16 or later (iOS 17+ recommended for best performance)
  • At least 4 GB of free storage (more if you plan to download multiple models)
  • An iPhone 12 or newer for acceptable performance — the A14 Bionic and later chips handle the matrix operations efficiently
  • A Wi-Fi connection for the initial model download

Step 1: Download the App

Search for “AI Edge Gallery” in the Apple App Store, or find it directly through Google’s AI Edge documentation. The app is free with no in-app purchases.

At time of writing, the iOS version is in active development, so you may need to check whether it’s available as a stable App Store release or as a TestFlight beta. Google has been rolling out iOS support progressively.

Step 2: Open the App and Browse Available Tasks

When you open AI Edge Gallery, you’ll see a list of available tasks and demos. These include:

  • Ask Image — Upload or capture a photo and ask questions about it
  • Prompt Lab — Open-ended text generation and Q&A
  • AI Feature Gallery — A collection of smaller, task-specific demos like summarization and classification

Step 3: Download a Model

Tap on a task, then select a model to download. The app shows you the model size, expected performance, and supported hardware before you commit.

For most iPhones, start with Gemma 3n E2B (an efficiency variant of the 2B model). It downloads faster and loads quickly. Gemma 3n E4B gives better output quality but needs more RAM.

The download runs in the background. Don’t close the app until it completes.

Step 4: Run the Model

Once downloaded, tap the model name to load it. Initial load takes 5–20 seconds depending on model size and device. After that, it stays in memory as long as the app is open.

Type a prompt into the input field and tap send. Tokens stream in real-time — you’ll see output generating word by word directly from your phone’s processor.

Step 5: Go Offline

Once the model is loaded, turn on airplane mode. Everything still works. This is the core value proposition: complete offline AI inference.


What You Can Do With It

Google AI Edge Gallery isn’t a general-purpose AI assistant. It’s a demonstration platform, so the feature set is intentionally focused.

Text Generation and Q&A

The Prompt Lab interface gives you a bare-bones chat experience. You can ask factual questions, summarize text you paste in, brainstorm ideas, or draft short pieces of writing.

Response quality is noticeably different from a cloud model. Gemma 3n 2B is capable but not as deep as GPT-4o or Gemini 1.5 Pro on complex reasoning tasks. For simple queries — “What does this contract clause mean?” or “Summarize these meeting notes” — it works well.

Image Understanding

With Gemma 3n’s multimodal capabilities, you can tap the image button, attach a photo, and ask questions about it. Examples that work well:

  • “What’s in this image?”
  • “Read the text in this sign”
  • “Describe what’s happening here”
  • “Is there anything unusual in this photo?”

This runs entirely on-device. Your photos never leave your phone.

Experimenting With Model Parameters

Unlike most consumer AI apps, AI Edge Gallery exposes model settings like temperature, top-k sampling, and max token length. This makes it useful for developers who want to understand how these parameters affect output before baking them into an application.

Benchmarking Device Performance

Because the app shows tokens-per-second in real-time, it’s also a practical way to benchmark how well your specific iPhone handles local inference workloads. Useful data if you’re evaluating whether on-device AI is viable for an app you’re building.


Performance: What to Expect on iPhone

Real-world performance varies by device. Here’s a rough picture based on available reports:

DeviceModelApprox. Tokens/sec
iPhone 16 ProGemma 3n E2B25–35
iPhone 15 ProGemma 3n E2B18–25
iPhone 14Gemma 3n E2B12–18
iPhone 12Gemma 2 2B8–12

These are approximate. Actual numbers depend on background processes, thermal state, and available RAM.

25 tokens per second is roughly 18–20 words per second — noticeably slower than typing speed, but fast enough for practical use. On newer A17 and A18 chips, performance continues to improve.

The Neural Engine in Apple Silicon handles the matrix multiplications efficiently. Google’s LiteRT runtime is optimized to use it where possible, which is why performance on iPhones tends to be strong relative to Android devices with equivalent chip specs.


Limitations Worth Knowing

Google AI Edge Gallery is experimental software. Going in with realistic expectations helps.

Context window is short. On-device models are limited by available RAM. Don’t expect to paste in long documents. Most configurations cap the context at 512–1024 tokens, which is enough for short exchanges but not deep document analysis.

The app isn’t polished. Crashes are possible, especially on older devices or when switching between models with large memory footprints. If it crashes, close other apps and try again.

Model updates are manual. Unlike cloud AI where the model improves in the background, on-device models stay static until you manually download an update.

No memory between sessions. Conversations don’t persist by default. Each session starts fresh.

Multimodal requires Gemma 3n. If you download an older text-only model, image features won’t appear. Make sure you select the right model for the task.


Where Cloud AI Still Wins

On-device AI is a genuine capability, not just a technical novelty. But it’s not a full replacement for cloud models yet.

Cloud models like Gemini 1.5 Pro or Claude 3.5 Sonnet have far larger context windows (up to 1 million tokens), stronger reasoning on complex tasks, and access to live information. They also integrate with tools, browse the web, and handle multi-step workflows.

On-device AI is best for:

  • Privacy-sensitive tasks where data can’t leave the device
  • Offline or low-connectivity environments
  • Low-latency use cases where speed matters more than output quality
  • Augmenting an app with simple AI features without API costs

The most practical approach for most teams is hybrid: run lightweight tasks locally, route complex or context-heavy tasks to the cloud.


How MindStudio Connects to This

If Google AI Edge Gallery shows you what on-device AI can do in isolation, MindStudio shows you what cloud AI can do when connected to the rest of your workflow.

MindStudio is a no-code platform for building AI agents — automations that don’t just generate text, but take actions across your tools. It gives you access to over 200 AI models (including the full Gemini family, Claude, GPT-4o, and more) without managing API keys, rate limits, or authentication.

The contrast is worth thinking about: running Gemma locally on your iPhone is great for privacy and offline use. But if you want to build an AI agent that reads emails, summarizes them, writes a response, and logs the output to a spreadsheet — that kind of multi-step, connected workflow is where MindStudio fits.

A few things you can build in MindStudio that complement on-device AI exploration:

  • Document processing agents — upload files and extract structured data using Gemini or Claude, then send output to Google Sheets or Airtable
  • Speech-to-text pipelines — transcribe audio and run downstream analysis with custom prompts
  • Multi-model comparison tools — route the same prompt to Gemma via API, GPT-4o, and Claude 3.5, then compare outputs side by side

MindStudio’s visual builder means you can prototype these agents in under an hour without writing code. You can try it free at mindstudio.ai.

If you’re curious about how Gemini models (the cloud counterparts to Gemma) fit into automated workflows, the MindStudio blog on Gemini integrations has practical coverage on that topic.


Frequently Asked Questions

Yes. Google AI Edge Gallery supports iOS in addition to Android. iOS support has been rolling out progressively, and the app may be available through the App Store or as a TestFlight beta depending on when you check. The experience is similar across platforms, though performance varies based on device hardware.

The app primarily supports Google’s Gemma family of open-weight models — including Gemma 3n (in E2B and E4B efficiency variants) and Gemma 2 2B. Gemma 3n is multimodal (text and images), while Gemma 2 2B handles text only. Model availability may expand as the app is updated.

Yes, once you’ve downloaded a model over Wi-Fi, the app runs entirely offline. No prompts, images, or responses are sent to any server. All inference happens locally on your device’s processor and Neural Engine.

Plan for 1–3 GB per model download, plus the app itself. If you download multiple models, storage requirements add up quickly. Make sure you have at least 4 GB of free space before starting, and closer to 8 GB if you want flexibility to try multiple models.

No. The Gemini app is Google’s consumer AI assistant, which connects to cloud-hosted models. AI Edge Gallery is a separate, experimental tool specifically for on-device inference. It’s aimed at developers and researchers who want to explore local AI capabilities — not a replacement for the Gemini app.

How does Gemma compare to GPT or Claude for on-device use?

Gemma is optimized for on-device deployment in a way that GPT and Claude are not — those models are cloud-only and not available for local inference. Among open-weight models suited for mobile hardware, Gemma 3n is competitive with Meta’s Llama 3.2 1B/3B variants and Microsoft’s Phi-3 mini. Each has tradeoffs in language support, speed, and reasoning depth, but all are significantly less capable than frontier cloud models on complex tasks.


Key Takeaways

  • Google AI Edge Gallery is a free, open-source app that runs Gemma LLMs entirely on your device — no internet, no server, no account required.
  • Gemma 3n is the recommended model for most iPhones — it’s multimodal, efficient, and performs well on A14 Bionic and newer chips.
  • Setup is straightforward: download the app, download a model over Wi-Fi, then use it offline indefinitely.
  • On-device AI is best for privacy-sensitive tasks, offline use, and low-latency applications — not for complex reasoning or long-context work.
  • Cloud AI and on-device AI are complementary, not competitive — tools like MindStudio let you build connected workflows with cloud models when local inference isn’t enough.

If you’re exploring the broader AI model landscape and want to build workflows around Gemini and other frontier models without managing infrastructure, MindStudio is worth trying — it takes most of the plumbing work off your plate so you can focus on what the AI actually does.

Presented by MindStudio

No spam. Unsubscribe anytime.