Following yesterday’s release of a new llm-scaler-omni beta there is now a new beta feature release of llm-scaler-vllm that provides the Intel-optimized version of vLLM within a Docker container that is set and ready to go for AI on modern Arc Graphics hardware. With today’s llm-scaler-vllm 1.2 beta release there is support for a variety of additional large language models (LLMs) and other improvements.
Going the route of llm-scaler-vllm continues to be Intel’s preferred choice for customers to leverage vLLM for AI workloads on their discrete graphics hardware. With this new llm-scaler-vllm 1.2 beta release there is support for new models and other enhancements to benefit the Intel vLLM experience:
– Fix 72-hour hang issue
– MoE-Int4 support for Qwen3-30B-A3B
– Bpe-Qwen tokenizer support
– Enable Qwen3-VL Dense/MoE models
– Enable Qwen3-Omni models
– MinerU 2.5 Support
– Enable whisper transcription models
– Fix minicpmv4.5 OOM issue and output error
– Enable ERNIE-4.5-vl models
– Enable Glyph based GLM-4.1V-9B-Base
– Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
– Gpt-oss 20B and 120B support in mxfp4 with optimized performance
– MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
– New models: added 8 multi-modality models, image/video are supported.
– vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
– fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
– Bug fixes
This work will be especially important for next year’s Crescent Island hardware release.
More details on the new beta release via GitHub while the llm-scaler-vllm Docker container is available via the Docker Hub container image library.
