Intel kicked off the new month by releasing the latest version of LLM Scaler vLLM (llm-scaler-vllm) as their Docker-based solution for running vLLM on Intel Battlemage GPUs for AI inferencing.
Intel llm-scaler-vllm v0.14.0-b8 is out today as the newest version of this solution for vLLM on Intel graphics hardware. This new version is rebased against vLLM 0.14 upstream while also upgrading PyTorch to 2.10 and pulling in the latest oneAPI components. Thanks to Intel oneDNN optimizations the INT4 performance is seeing up to a 25% throughput improvement compared to the prior release.
There is also new LLM coverage with this llm-scaler-vllm update, with now officially supporting Qwen3-VL-Reranker-2B/8B, Qwen3-VL-Embedding-2B/8B, GLM-4.7-Flash, Ministral models, DeepSeek-OCR-2, and Qwen3-Coder-Next.
There is also validated support now for the BMG-G31 “Big Battlemage” GPU. The Intel BMG-G31 has remained elusive with no official announcement yet, rumors of its cancellation, etc, but the open-source software enablement around it continues. With this llm-scaler-vllm update seeming to confirm it’s still coming as there is now validated support. The announcement even mentions some word on its performance uplift:
“G31 validation has been added in this release and all models are functional. The key models’ performance is measured on a non-golden setup B70 system (limited perf for allreduce with small message size), compare with G21: 1.49x geomean under SLA constraints and 1.13x geomean at fixed batch size. The throughput should be better on system with golden BKC setup.”
Seemingly confirming as well that the talked about Arc Pro B70 is indeed BMG-G31. But whether BMG-G31 will appear in any consumer Intel Arc Graphics card remains to be seen. But 1.49x geo mean performance with SLA constraints is quite exciting.
See the GitHub release announcement for more details on the Intel llm-scaler-vllm update.
