Skip to main content
MindStudio
Pricing
Blog About
My Workspace
Text Generation Model

GLM 4.6V

A 106B multimodal foundation model with native function calling, 128K context, and state-of-the-art visual understanding for real-world AI agent applications.

Publisher Z.ai
Type Text
Context Window 131,072 tokens
Training Data December 2025
Input $0.30/MTok
Output $0.90/MTok
Provider DeepInfra

106B multimodal model with native visual function calling

GLM-4.6V is a large-scale multimodal foundation model developed by Z.ai, available in two variants: the full 106B parameter version designed for cloud and high-performance cluster deployments, and a lightweight 9B Flash version optimized for local and low-latency use. The model supports a 128K token context window, allowing it to process long documents, multi-page files, and complex mixed-media inputs natively without converting content to plain text first. It was trained with a data cutoff of December 2025.

What distinguishes GLM-4.6V is its native integration of tool-use capabilities within a visual model — it can accept images, screenshots, and document pages directly as inputs to function calls, connecting visual perception to executable actions in agent workflows. The model also supports interleaved image-text generation, frontend replication from UI screenshots, and joint understanding of text, layout, charts, tables, and figures. It is best suited for enterprise and agent-based applications such as document analysis pipelines, multimodal AI assistants, UI automation, and content generation workflows.

What GLM 4.6V supports

Visual Function Calling

Natively passes images, screenshots, and document pages as tool inputs, enabling direct integration of visual perception into agent action loops without intermediate conversion steps.

Long Document Understanding

Processes up to 128K tokens of multi-document or long-document input, jointly interpreting text, layout, charts, tables, and figures in a single pass.

Multimodal Input Processing

Accepts interleaved image and text inputs, supporting complex mixed-media prompts that combine visual and textual content in a single context.

Interleaved Image-Text Generation

Generates coherent mixed-media outputs from multimodal inputs, actively calling search and retrieval tools during generation to produce visually grounded content.

Frontend Replication

Reconstructs pixel-accurate HTML and CSS from UI screenshots and supports natural-language-driven iterative edits to the generated code.

Agent Workflow Integration

Designed for agentic pipelines where visual understanding must trigger downstream actions, supporting tool orchestration across document analysis and UI automation tasks.

Ready to build with GLM 4.6V?

Get Started Free

Benchmark scores

Scores represent accuracy — the percentage of questions answered correctly on each test.

Benchmark What it tests Score
MMLU-Pro Expert knowledge across 14 academic disciplines 78.4%
GPQA Diamond PhD-level science questions (biology, physics, chemistry) 63.2%
LiveCodeBench Real-world coding tasks from recent competitions 56.1%
HLE Questions that challenge frontier models across many domains 5.2%
SciCode Scientific research coding and numerical methods 33.1%

Common questions about GLM 4.6V

What is the context window for GLM-4.6V?

GLM-4.6V supports a context window of 128K tokens (131,072 tokens), allowing it to process long documents, multi-page files, and extended multimodal inputs in a single request.

What are the two available versions of GLM-4.6V?

GLM-4.6V is available as a 106B parameter model intended for cloud and high-performance cluster deployments, and as GLM-4.6V-Flash, a 9B parameter variant optimized for local inference and low-latency use cases.

What is the training data cutoff for GLM-4.6V?

According to the model metadata, GLM-4.6V has a training date of December 2025.

What types of inputs does GLM-4.6V accept?

GLM-4.6V accepts interleaved image and text inputs. It can process images, screenshots, document pages, charts, tables, and figures alongside text within its 128K token context window.

Who developed GLM-4.6V and where can I access the model weights?

GLM-4.6V was developed by Z.ai (also referred to as Zai). Model weights and a model card are available on Hugging Face at huggingface.co/zai-org/GLM-4.6V, and the code repository is hosted on GitHub at github.com/zai-org/GLM-V.

What people think about GLM 4.6V

Community reception on r/LocalLLaMA was positive at launch, with both the 106B and 9B Flash variants receiving several hundred upvotes and active discussion threads. Users highlighted the native visual function calling capability and the availability of a locally runnable 9B version as notable aspects of the release.

Some community members discussed hardware requirements for running the larger 106B model, with a later thread specifically covering deployment on dual RTX Pro 6000 setups with 192 GB VRAM using vLLM. The Flash variant drew particular interest from users focused on local inference and lower-resource deployments.

View more discussions →

Parameters & options

Max Temperature 1
Max Response Size 16,384 tokens
Reasoning Effort Toggle Group
Default: medium

Start building with GLM 4.6V

No API keys required. Create AI-powered workflows with GLM 4.6V in minutes — free.