Loading...
Loading...
Qwen3-VL-30B-A3B-Instruct is a multimodal model that unifies strong text generation with visual understanding for images and videos. Its Instruct variant optimizes instruction-following for general multimodal tasks. It excels in perception of real-world/synthetic categories, 2D/3D spatial grounding, and long-form visual comprehension, achieving competitive multimodal benchmark results. For agentic use, it handles multi-image multi-turn instructions, video timeline alignments, GUI automation, and visual coding from sketches to debugged UI. Text performance matches flagship Qwen3 models, suiting document AI, OCR, UI assistance, spatial tasks, and agent research.
Quality Index
16.1
242nd of 442
Top 55%
Coding Index
14.3
203rd of 352
Top 58%
Math Index
72.3
89th of 268
Top 34%
Price/1M
$0.35
346th cheapest
13% above median
Top 52%
Speed
123 tok/s
Top 22%
TTFT
1.12s
Context Window
131K
145th largest
Top 63%
Input
$0.20
per 1M tokens
Output
$0.80
per 1M tokens
Blended
$0.35
per 1M tokens
Cheaper than 48% of models. Median price is $0.31/1M tokens.
Daily
$0.35
Monthly
$10.50
123
tokens/sec
Faster than 78% of models
1.12
seconds
Faster than 23% of models
1.12
seconds
Faster than 34% of models
Market Median
46 tok/s
170% faster
Median TTFT
0.42s
169% slower
Throughput/Dollar
352
tok/s per $/1M
Speed Comparison
Context Window
131K
tokens
Larger than 37% of models
Max Output
33K
tokens
25% of context
3.7M
552
24-48 GB
A6000 / M3 Ultra