About

Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text comprehension, enabling fine-grained spatial reasoning, document and scene analysis, and long-horizon video understanding.Robust OCR in 32 languages, and enhanced multimodal fusion through Interleaved-MRoPE and DeepStack architectures. Optimized for agentic interaction and visual tool use, Qwen3-VL-32B delivers state-of-the-art performance for complex real-world multimodal tasks.

Model Family

Qwen3.5 9B (Reasoning)2026-03-02 Qwen3.5 9B (Non-reasoning)2026-03-02 Qwen3.5 4B (Reasoning)2026-03-02 Qwen3.5 4B (Non-reasoning)2026-03-02 Qwen3.5 2B (Reasoning)2026-03-02 Qwen3.5 2B (Non-reasoning)2026-03-02 Qwen3.5 0.8B (Reasoning)2026-03-02 Qwen3.5 0.8B (Non-reasoning)2026-03-02

Benchmarks

MMLU-Pro

79.1%

GPQA Diamond

67.1%

HLE

6.3%

LiveCodeBench

51.4%

SciCode

30.1%

TerminalBench Hard

8.3%

MATH-500Not evaluated

AIMENot evaluated

AIME 2025

68.3%

IFBench

39.2%

Long Context Recall

31.3%

Tau2

29.2%

Market AverageTop Score

Open Source

HuggingFace

apache-2.032BGGUF / GPTQ / AWQ

Downloads (30d)

1.0M

Likes

190

VRAM (FP16)

24-48 GB

GPU

A6000 / M3 Ultra

Qwen3 VL 32B Instruct

About

Model Family

Market Position

Pricing

Cost Calculator

vs. Similar Models

Performance

Benchmarks

Open Source

Quick Compare

Similar Models