Vision Model

Grok 2 Vision

A multimodal vision model from xAI that combines strong image understanding with text reasoning, multilingual support, and enhanced instruction-following.

Start Building with Grok 2 Vision View All Models

Publisher

X.ai

Type Vision

Context Window 32,768 tokens

Training Data December 2024

Input $2.00/MTok

Output $10.00/MTok

Try Grok 2 Vision →

About Grok 2 Vision

Multimodal image and text reasoning from xAI

Grok 2 Vision (grok-2-vision-1212) is a multimodal language model developed by xAI and released in December 2024. It accepts combined image and text inputs and is designed to understand, analyze, and respond to visual content alongside natural language. The model supports images up to 20MiB in JPG, JPEG, or PNG format and can process inputs in any order. It also includes multilingual support and improved instruction-following compared to earlier Grok vision releases.

Grok 2 Vision is suited for production use cases that require visual comprehension, such as image captioning, visual question answering, chart and document analysis, and building AI assistants that respond to visual inputs. It supports tool calling and structured outputs, making it straightforward to integrate into developer workflows. With a 32,768-token context window, it can handle moderately long conversations that mix text and image content.

Capabilities

What Grok 2 Vision supports

Image Understanding

Analyzes image content including objects, styles, charts, and documents. Accepts JPG, JPEG, or PNG files up to 20MiB per image.

Multimodal Input

Accepts interleaved text and image inputs in any order within a single request, enabling flexible prompt construction.

Multilingual Support

Processes and generates responses in multiple languages, making it usable for internationally facing applications.

Instruction Following

Follows complex and nuanced prompts with improved steerability introduced in the December 2024 release.

Tool Calling

Supports function calling so developers can connect the model to external tools and APIs within their pipelines.

Structured Outputs

Returns structured data formats and supports temperature control for predictable, integration-ready responses.

Visual Question Answering

Answers natural language questions about image content, including charts, diagrams, and scanned documents.

Long Context Window

Supports up to 32,768 tokens per request, accommodating extended conversations that mix text and image inputs.

Ready to build with Grok 2 Vision?

Get Started Free

Performance

Benchmark scores

Scores represent accuracy — the percentage of questions answered correctly on each test.

Benchmark	What it tests	Score
MMLU-Pro	Expert knowledge across 14 academic disciplines	70.9%
GPQA Diamond	PhD-level science questions (biology, physics, chemistry)	51.0%
MATH-500	Undergraduate and competition-level math problems	77.8%
AIME 2024	American math olympiad problems	13.3%
LiveCodeBench	Real-world coding tasks from recent competitions	26.7%
HLE	Questions that challenge frontier models across many domains	3.8%
SciCode	Scientific research coding and numerical methods	28.5%

FAQ

Common questions about Grok 2 Vision

What is the context window for Grok 2 Vision?

Grok 2 Vision supports a context window of 32,768 tokens per request.

What image formats does Grok 2 Vision accept?

The model accepts JPG, JPEG, and PNG image formats, with a maximum file size of 20MiB per image.

When was Grok 2 Vision released and what is its training cutoff?

Grok 2 Vision was released in December 2024, with a training date listed as December 2024.

Does Grok 2 Vision support tool calling?

Yes, Grok 2 Vision supports function calling and structured outputs, allowing integration with external tools and APIs.

Who publishes Grok 2 Vision and where can I access it via API?

Grok 2 Vision is published by xAI (the AI division of X). It is accessible through the xAI API and is also listed on OpenRouter under the model ID grok-2-vision-1212.

Resources