PyTorch 2.8 Released With Better Intel CPU Performance For LLM Inference

Last updated: 2025/08/06 at 9:29 PM

News Room Published 6 August 2025

PyTorch 2.8 released today as the newest feature update to this widely-used machine learning library that has become a crucial piece for deep learning and other AI usage. There are a few interesting changes worth highlighting with the new PyTorch 2.8 release.

Piquing my interest with PyTorch 2.8 is improved Intel CPU performance. In particular, a focus on high performance quantized large language model (LLM) inference for Intel CPUs using the native PyTorch version. The change outlines the LLM quantization work done by Intel engineers to enhance their x86_64 CPU performance with native PyTorch. A16W8, DA8W8 and A16W4 are among the supported modes. That issue ticket noted:

“With this feature, the performance with PyTorch native stack can reach the same level or even better in some cases as comparing with popular LLM serving frameworks like vLLM when running offline mode on a single x86_64 CPU device, which enables PyTorch users to run LLM quantization with native experience and good performance.”

There have been a lot of Intel CPU commits this cycle such as for FP8 QCONV, FP8 QLINEAR, and using AMX-based micro-kernels in more instances. The AMX micro-kernel improvement can be quite beneficial:

“GEMM templates for INT4 weights are used for lowering `aten._weight_int4pack_mm_for_cpu` with Inductor when max-autotune is on. Currently, AMX-based microkernels are used only when M >= 16 if input tensor has shape [M, K]. However, we find that AMX kernel brings performance benefit when 4 < M < 16. For example, on a 6th gen of Intel(R) Xeon(R) platform, E2E latency can be improved by up to > 20% when running Llama-3.1-8B on 32 cores for M = 8. So, this PR changes the threshold so that AMX is used when M > 4.”

Too bad though that my AvenueCity reference server remains non-operational and thus unable to test the newest PyTorch release (and other Intel open-source improvements in recent months) on the flagship Xeon 6980P Granite Rapids processors… So, unfortunately, no new Xeon 6900P benchmarks at this time on Phoronix.

Intel Xeon 6980P

Also on the Intel side for PyTorch 2.8 is experimental support for the Intel XCCL GPU distributed back-end. XCCL is a distributed back-end for Intel discrete GPUs for various distributed training paradigms.

PyTorch 2.8 also brings SYCL support to the PyTorch CPP Extension API, A16W4 support for XPU devices, experimental wheel variant support, and other enhancements.

Downloads and more details on the PyTorch 2.8 release via the PyTorch.org blog and GitHub.