PyTorch 2.10 is out today as the latest feature update to this widely-used deep learning library. The new PyTorch release continues improving support for Intel GPUs as well as for the AMD ROCm compute stack along with still driving more enhancements for NVIDIA CUDA.
PyTorch 2.10 for AMD ROCm now enables grouped GEMM via regular GEMM fallback and via CK. There is also better ROCm support for PyTorch on Microsoft Windows, torch.cuda._compile_kernel support, load_inline support, GFX1150/GFX1151 RDNA 3.5 GPUs are added to the hipblaslt-supported GEMM lists, scaled_mm v2 support, AOTriton scaled_dot_product_attention, improved heuristics for pointwise kernels on ROCm, code generation support for fast_tanhf on ROCm, and other improvements.
Intel GPU support also is enjoying a number of improvements with PyTorch 2.10. A number of additional Torch XPU APIs are now in place for Intel GPUs, support for ATen operators scaled_mm and scaled_mm_v2, _weight_int8pack_mm support, and the SYCL support in the PyTorch CPP Extension API now allows for implementing new custom operators on Windows. There are also some Intel performance optimizations and other improvements.
The NVIDIA CUDA support in PyTorch 2.10 also boasts more features. CUDA on PyTorch 2.10 enables templated kernels, pre-compiled kernel support, adding CUDA headers automatically, support for the cuda-python CUDA stream protocol, CUDA 13 compatibility improvements, support for nested memory pools, CUTLASS MATMULs on Thor, and other features.
PyTorch 2.10 also brings Python 3.14 support for torch.compole() as well as experimental support for the Python 3.14 free-threaded build. There is also lower kernel launch overhead with combo-kernels horizontal fusion in Torch Inductor, debug improvements, and different quantization enhancements.
Downloads and more details on PyTorch 2.10 via GitHub.
