Intel Llm-scaler-vllm Beta 1.2 Brings Support For New AI Models On Arc Graphics

Intel llm-scaler-vllm Beta 1.2 Brings Support For New AI Models On Arc Graphics

Last updated: 2025/12/11 at 5:37 AM

News Room Published 11 December 2025

Following yesterday’s release of a new llm-scaler-omni beta there is now a new beta feature release of llm-scaler-vllm that provides the Intel-optimized version of vLLM within a Docker container that is set and ready to go for AI on modern Arc Graphics hardware. With today’s llm-scaler-vllm 1.2 beta release there is support for a variety of additional large language models (LLMs) and other improvements.

Going the route of llm-scaler-vllm continues to be Intel’s preferred choice for customers to leverage vLLM for AI workloads on their discrete graphics hardware. With this new llm-scaler-vllm 1.2 beta release there is support for new models and other enhancements to benefit the Intel vLLM experience:

– Fix 72-hour hang issue
– MoE-Int4 support for Qwen3-30B-A3B
– Bpe-Qwen tokenizer support
– Enable Qwen3-VL Dense/MoE models
– Enable Qwen3-Omni models
– MinerU 2.5 Support
– Enable whisper transcription models
– Fix minicpmv4.5 OOM issue and output error
– Enable ERNIE-4.5-vl models
– Enable Glyph based GLM-4.1V-9B-Base
– Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
– Gpt-oss 20B and 120B support in mxfp4 with optimized performance
– MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
– New models: added 8 multi-modality models, image/video are supported.
– vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
– fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
– Bug fixes