Skip to main content
MindStudio
Pricing
Blog About
My Workspace
Vision Model

Grok 2 Vision

A multimodal vision model from xAI that combines strong image understanding with text reasoning, multilingual support, and enhanced instruction-following.

Publisher X.ai
Type Vision
Context Window 32,768 tokens
Training Data December 2024
Input $2.00/MTok
Output $10.00/MTok

Multimodal image and text reasoning from xAI

Grok 2 Vision (grok-2-vision-1212) is a multimodal language model developed by xAI and released in December 2024. It accepts combined image and text inputs and is designed to understand, analyze, and respond to visual content alongside natural language. The model supports images up to 20MiB in JPG, JPEG, or PNG format and can process inputs in any order. It also includes multilingual support and improved instruction-following compared to earlier Grok vision releases.

Grok 2 Vision is suited for production use cases that require visual comprehension, such as image captioning, visual question answering, chart and document analysis, and building AI assistants that respond to visual inputs. It supports tool calling and structured outputs, making it straightforward to integrate into developer workflows. With a 32,768-token context window, it can handle moderately long conversations that mix text and image content.

What Grok 2 Vision supports

Image Understanding

Analyzes image content including objects, styles, charts, and documents. Accepts JPG, JPEG, or PNG files up to 20MiB per image.

Multimodal Input

Accepts interleaved text and image inputs in any order within a single request, enabling flexible prompt construction.

Multilingual Support

Processes and generates responses in multiple languages, making it usable for internationally facing applications.

Instruction Following

Follows complex and nuanced prompts with improved steerability introduced in the December 2024 release.

Tool Calling

Supports function calling so developers can connect the model to external tools and APIs within their pipelines.

Structured Outputs

Returns structured data formats and supports temperature control for predictable, integration-ready responses.

Visual Question Answering

Answers natural language questions about image content, including charts, diagrams, and scanned documents.

Long Context Window

Supports up to 32,768 tokens per request, accommodating extended conversations that mix text and image inputs.

Ready to build with Grok 2 Vision?

Get Started Free

Benchmark scores

Scores represent accuracy — the percentage of questions answered correctly on each test.

Benchmark What it tests Score
MMLU-Pro Expert knowledge across 14 academic disciplines 70.9%
GPQA Diamond PhD-level science questions (biology, physics, chemistry) 51.0%
MATH-500 Undergraduate and competition-level math problems 77.8%
AIME 2024 American math olympiad problems 13.3%
LiveCodeBench Real-world coding tasks from recent competitions 26.7%
HLE Questions that challenge frontier models across many domains 3.8%
SciCode Scientific research coding and numerical methods 28.5%

Common questions about Grok 2 Vision

What is the context window for Grok 2 Vision?

Grok 2 Vision supports a context window of 32,768 tokens per request.

What image formats does Grok 2 Vision accept?

The model accepts JPG, JPEG, and PNG image formats, with a maximum file size of 20MiB per image.

When was Grok 2 Vision released and what is its training cutoff?

Grok 2 Vision was released in December 2024, with a training date listed as December 2024.

Does Grok 2 Vision support tool calling?

Yes, Grok 2 Vision supports function calling and structured outputs, allowing integration with external tools and APIs.

Who publishes Grok 2 Vision and where can I access it via API?

Grok 2 Vision is published by xAI (the AI division of X). It is accessible through the xAI API and is also listed on OpenRouter under the model ID grok-2-vision-1212.

Parameters & options

Max Temperature 1

Start building with Grok 2 Vision

No API keys required. Create AI-powered workflows with Grok 2 Vision in minutes — free.