GLM 4.6V
A 106B multimodal foundation model with native function calling, 128K context, and state-of-the-art visual understanding for real-world AI agent applications.
106B multimodal model with native visual function calling
GLM-4.6V is a large-scale multimodal foundation model developed by Z.ai, available in two variants: the full 106B parameter version designed for cloud and high-performance cluster deployments, and a lightweight 9B Flash version optimized for local and low-latency use. The model supports a 128K token context window, allowing it to process long documents, multi-page files, and complex mixed-media inputs natively without converting content to plain text first. It was trained with a data cutoff of December 2025.
What distinguishes GLM-4.6V is its native integration of tool-use capabilities within a visual model — it can accept images, screenshots, and document pages directly as inputs to function calls, connecting visual perception to executable actions in agent workflows. The model also supports interleaved image-text generation, frontend replication from UI screenshots, and joint understanding of text, layout, charts, tables, and figures. It is best suited for enterprise and agent-based applications such as document analysis pipelines, multimodal AI assistants, UI automation, and content generation workflows.
What GLM 4.6V supports
Visual Function Calling
Natively passes images, screenshots, and document pages as tool inputs, enabling direct integration of visual perception into agent action loops without intermediate conversion steps.
Long Document Understanding
Processes up to 128K tokens of multi-document or long-document input, jointly interpreting text, layout, charts, tables, and figures in a single pass.
Multimodal Input Processing
Accepts interleaved image and text inputs, supporting complex mixed-media prompts that combine visual and textual content in a single context.
Interleaved Image-Text Generation
Generates coherent mixed-media outputs from multimodal inputs, actively calling search and retrieval tools during generation to produce visually grounded content.
Frontend Replication
Reconstructs pixel-accurate HTML and CSS from UI screenshots and supports natural-language-driven iterative edits to the generated code.
Agent Workflow Integration
Designed for agentic pipelines where visual understanding must trigger downstream actions, supporting tool orchestration across document analysis and UI automation tasks.
Ready to build with GLM 4.6V?
Get Started FreeBenchmark scores
Scores represent accuracy — the percentage of questions answered correctly on each test.
| Benchmark | What it tests | Score |
|---|---|---|
| MMLU-Pro | Expert knowledge across 14 academic disciplines | 78.4% |
| GPQA Diamond | PhD-level science questions (biology, physics, chemistry) | 63.2% |
| LiveCodeBench | Real-world coding tasks from recent competitions | 56.1% |
| HLE | Questions that challenge frontier models across many domains | 5.2% |
| SciCode | Scientific research coding and numerical methods | 33.1% |
Common questions about GLM 4.6V
What is the context window for GLM-4.6V?
GLM-4.6V supports a context window of 128K tokens (131,072 tokens), allowing it to process long documents, multi-page files, and extended multimodal inputs in a single request.
What are the two available versions of GLM-4.6V?
GLM-4.6V is available as a 106B parameter model intended for cloud and high-performance cluster deployments, and as GLM-4.6V-Flash, a 9B parameter variant optimized for local inference and low-latency use cases.
What is the training data cutoff for GLM-4.6V?
According to the model metadata, GLM-4.6V has a training date of December 2025.
What types of inputs does GLM-4.6V accept?
GLM-4.6V accepts interleaved image and text inputs. It can process images, screenshots, document pages, charts, tables, and figures alongside text within its 128K token context window.
Who developed GLM-4.6V and where can I access the model weights?
GLM-4.6V was developed by Z.ai (also referred to as Zai). Model weights and a model card are available on Hugging Face at huggingface.co/zai-org/GLM-4.6V, and the code repository is hosted on GitHub at github.com/zai-org/GLM-V.
What people think about GLM 4.6V
Community reception on r/LocalLLaMA was positive at launch, with both the 106B and 9B Flash variants receiving several hundred upvotes and active discussion threads. Users highlighted the native visual function calling capability and the availability of a locally runnable 9B version as notable aspects of the release.
Some community members discussed hardware requirements for running the larger 106B model, with a later thread specifically covering deployment on dual RTX Pro 6000 setups with 192 GB VRAM using vLLM. The Flash variant drew particular interest from users focused on local inference and lower-resource deployments.
zai-org/GLM-4.6V-Flash (9B) is here
GLM-4.6V (108B) has been released
HOWTO: Running the best models on a dual RTX Pro 6000 rig with vLLM (192 GB VRAM)
Parameters & options
Explore similar models
Start building with GLM 4.6V
No API keys required. Create AI-powered workflows with GLM 4.6V in minutes — free.