Grok 2 Vision
A multimodal vision model from xAI that combines strong image understanding with text reasoning, multilingual support, and enhanced instruction-following.
Multimodal image and text reasoning from xAI
Grok 2 Vision (grok-2-vision-1212) is a multimodal language model developed by xAI and released in December 2024. It accepts combined image and text inputs and is designed to understand, analyze, and respond to visual content alongside natural language. The model supports images up to 20MiB in JPG, JPEG, or PNG format and can process inputs in any order. It also includes multilingual support and improved instruction-following compared to earlier Grok vision releases.
Grok 2 Vision is suited for production use cases that require visual comprehension, such as image captioning, visual question answering, chart and document analysis, and building AI assistants that respond to visual inputs. It supports tool calling and structured outputs, making it straightforward to integrate into developer workflows. With a 32,768-token context window, it can handle moderately long conversations that mix text and image content.
What Grok 2 Vision supports
Image Understanding
Analyzes image content including objects, styles, charts, and documents. Accepts JPG, JPEG, or PNG files up to 20MiB per image.
Multimodal Input
Accepts interleaved text and image inputs in any order within a single request, enabling flexible prompt construction.
Multilingual Support
Processes and generates responses in multiple languages, making it usable for internationally facing applications.
Instruction Following
Follows complex and nuanced prompts with improved steerability introduced in the December 2024 release.
Tool Calling
Supports function calling so developers can connect the model to external tools and APIs within their pipelines.
Structured Outputs
Returns structured data formats and supports temperature control for predictable, integration-ready responses.
Visual Question Answering
Answers natural language questions about image content, including charts, diagrams, and scanned documents.
Long Context Window
Supports up to 32,768 tokens per request, accommodating extended conversations that mix text and image inputs.
Ready to build with Grok 2 Vision?
Get Started FreeBenchmark scores
Scores represent accuracy — the percentage of questions answered correctly on each test.
| Benchmark | What it tests | Score |
|---|---|---|
| MMLU-Pro | Expert knowledge across 14 academic disciplines | 70.9% |
| GPQA Diamond | PhD-level science questions (biology, physics, chemistry) | 51.0% |
| MATH-500 | Undergraduate and competition-level math problems | 77.8% |
| AIME 2024 | American math olympiad problems | 13.3% |
| LiveCodeBench | Real-world coding tasks from recent competitions | 26.7% |
| HLE | Questions that challenge frontier models across many domains | 3.8% |
| SciCode | Scientific research coding and numerical methods | 28.5% |
Common questions about Grok 2 Vision
What is the context window for Grok 2 Vision?
Grok 2 Vision supports a context window of 32,768 tokens per request.
What image formats does Grok 2 Vision accept?
The model accepts JPG, JPEG, and PNG image formats, with a maximum file size of 20MiB per image.
When was Grok 2 Vision released and what is its training cutoff?
Grok 2 Vision was released in December 2024, with a training date listed as December 2024.
Does Grok 2 Vision support tool calling?
Yes, Grok 2 Vision supports function calling and structured outputs, allowing integration with external tools and APIs.
Who publishes Grok 2 Vision and where can I access it via API?
Grok 2 Vision is published by xAI (the AI division of X). It is accessible through the xAI API and is also listed on OpenRouter under the model ID grok-2-vision-1212.
Documentation & links
Parameters & options
Explore similar models
Start building with Grok 2 Vision
No API keys required. Create AI-powered workflows with Grok 2 Vision in minutes — free.