Text Generation Model

GLM 4.6V

A 106B multimodal foundation model with native function calling, 128K context, and state-of-the-art visual understanding for real-world AI agent applications.

Start Building with GLM 4.6V View All Models

Publisher

Z.ai

Type Text

Context Window 131,072 tokens

Training Data December 2025

Input $0.30/MTok

Output $0.90/MTok

Provider

DeepInfra

Try GLM 4.6V →

About GLM 4.6V

106B multimodal model with native visual function calling

GLM-4.6V is a large-scale multimodal foundation model developed by Z.ai, available in two variants: the full 106B parameter version designed for cloud and high-performance cluster deployments, and a lightweight 9B Flash version optimized for local and low-latency use. The model supports a 128K token context window, allowing it to process long documents, multi-page files, and complex mixed-media inputs natively without converting content to plain text first. It was trained with a data cutoff of December 2025.

What distinguishes GLM-4.6V is its native integration of tool-use capabilities within a visual model — it can accept images, screenshots, and document pages directly as inputs to function calls, connecting visual perception to executable actions in agent workflows. The model also supports interleaved image-text generation, frontend replication from UI screenshots, and joint understanding of text, layout, charts, tables, and figures. It is best suited for enterprise and agent-based applications such as document analysis pipelines, multimodal AI assistants, UI automation, and content generation workflows.

Capabilities

What GLM 4.6V supports

Visual Function Calling

Natively passes images, screenshots, and document pages as tool inputs, enabling direct integration of visual perception into agent action loops without intermediate conversion steps.

Long Document Understanding

Processes up to 128K tokens of multi-document or long-document input, jointly interpreting text, layout, charts, tables, and figures in a single pass.

Multimodal Input Processing

Accepts interleaved image and text inputs, supporting complex mixed-media prompts that combine visual and textual content in a single context.

Interleaved Image-Text Generation

Generates coherent mixed-media outputs from multimodal inputs, actively calling search and retrieval tools during generation to produce visually grounded content.

Frontend Replication

Reconstructs pixel-accurate HTML and CSS from UI screenshots and supports natural-language-driven iterative edits to the generated code.

Agent Workflow Integration

Designed for agentic pipelines where visual understanding must trigger downstream actions, supporting tool orchestration across document analysis and UI automation tasks.

Ready to build with GLM 4.6V?

Get Started Free

Performance

Benchmark scores

Scores represent accuracy — the percentage of questions answered correctly on each test.

Benchmark	What it tests	Score
MMLU-Pro	Expert knowledge across 14 academic disciplines	78.4%
GPQA Diamond	PhD-level science questions (biology, physics, chemistry)	63.2%
LiveCodeBench	Real-world coding tasks from recent competitions	56.1%
HLE	Questions that challenge frontier models across many domains	5.2%
SciCode	Scientific research coding and numerical methods	33.1%

FAQ

Common questions about GLM 4.6V

What is the context window for GLM-4.6V?

GLM-4.6V supports a context window of 128K tokens (131,072 tokens), allowing it to process long documents, multi-page files, and extended multimodal inputs in a single request.

What are the two available versions of GLM-4.6V?

GLM-4.6V is available as a 106B parameter model intended for cloud and high-performance cluster deployments, and as GLM-4.6V-Flash, a 9B parameter variant optimized for local inference and low-latency use cases.

What is the training data cutoff for GLM-4.6V?

According to the model metadata, GLM-4.6V has a training date of December 2025.

What types of inputs does GLM-4.6V accept?

GLM-4.6V accepts interleaved image and text inputs. It can process images, screenshots, document pages, charts, tables, and figures alongside text within its 128K token context window.

Who developed GLM-4.6V and where can I access the model weights?

GLM-4.6V was developed by Z.ai (also referred to as Zai). Model weights and a model card are available on Hugging Face at huggingface.co/zai-org/GLM-4.6V, and the code repository is hosted on GitHub at github.com/zai-org/GLM-V.

Community Discussion

What people think about GLM 4.6V

Community reception on r/LocalLLaMA was positive at launch, with both the 106B and 9B Flash variants receiving several hundred upvotes and active discussion threads. Users highlighted the native visual function calling capability and the availability of a locally runnable 9B version as notable aspects of the release.

Some community members discussed hardware requirements for running the larger 106B model, with a later thread specifically covering deployment on dual RTX Pro 6000 setups with 192 GB VRAM using vLLM. The Flash variant drew particular interest from users focused on local inference and lower-resource deployments.

r/LocalLLaMA 412 pts 65 comments

zai-org/GLM-4.6V-Flash (9B) is here

r/LocalLLaMA 396 pts 81 comments

GLM-4.6V (108B) has been released

r/LocalLLaMA 48 pts 26 comments

HOWTO: Running the best models on a dual RTX Pro 6000 rig with vLLM (192 GB VRAM)

View more discussions →

Resources