Linux Lands Fix For Early 6.17 Regression Causing 37~43% Performance Hit

Last updated: 2025/08/13 at 6:58 PM

News Room Published 13 August 2025

Back during the Linux 6.17 merge window was an optimization geared for ARM64 that could have a “16x reduction” in the number of calls. Unfortunately that commit ended up causing a rather significant regression for some systems that has now been addressed.

Last week Intel’s kernel test robot began reporting a 37% regression in one of the stress-ng kernel micro-benchmarks. Oracle engineer Lorenzo Stoakes was able to reproduce and from an Intel Raptor Lake system observed a 43% regression with the Linux 6.17 Git kernel.

Stoakes tracked down the issue and managed to land a fix that is now merged today to Linux Git for avoiding expensive folio lookups on mremap folio PTE batch. Stoakes summarizes:

“It was discovered in the attached report that commit f822a9a81a31 (“mm: optimize mremap() by PTE batching”) introduced a significant performance regression on a number of metrics on x86-64, most notably stress-ng.bigheap.realloc_calls_per_sec – indicating a 37.3% regression in number of mremap() calls per second.

I was able to reproduce this locally on an intel x86-64 raptor lake system, noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or 3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of 2,131 or 2.6%) – a 43.3% regression.

During testing I was able to determine that there was no meaningful difference in efforts to optimise the folio_pte_batch() operation, nor checking folio_test_large().

This is within expectation, as a regression this large is likely to indicate we are accessing memory that is not yet in a cache line (and perhaps may even cause a main memory fetch).

The expectation by those discussing this from the start was that vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the culprit due to having to retrieve memory from the vmemmap (which mremap() page table moves does not otherwise do, meaning this is inevitably cold memory).

I was able to definitively determine that this theory is indeed correct and the cause of the issue.

The solution is to restore part of an approach previously discarded on review, that is to invoke pte_batch_hint() which explicitly determines, through reference to the PTE alone (thus no vmemmap lookup), what the PTE batch size may be.

On platforms other than arm64 this is currently hardcoded to return 1, so this naturally resolves the issue for x86-64, and for arm64 introduces little to no overhead as the pte cache line will be hot.

With this patch applied, we move from 81,503 realloc calls/sec to 138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however accounting for the variance in the original result, this is broadly restoring performance to its prior state.”

So a big regression now under control just days after the Linux 6.17-rc1 release. Just a two-line patch to avoid the expensive folio lookup when it’s unlikely to provide any benefit.

I will see if it ends up helping any of the mixed benchmark results I saw on Linux 6.17 during early testing while more benchmarking will be getting underway in the days ahead.

Linux Lands Fix For Early 6.17 Regression Causing 37~43% Performance Hit

Leave a Reply Cancel reply

Stay Connected

Latest News

Cardano (ADA) Can Wait; Ruvi AI’s (RUVI) New CoinMarketCap Listing Sparks Presale Frenzy, Phase 2 Hits 80% in No Time

Funko names Netflix vet as new CEO a week after pop culture collectibles maker reported $41M loss

You can score the M3 iPad Air for its lowest price yet

Intel ISPC 1.28 Adds Optimized Support For AMD Zen 4 & Zen 5 CPUs

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News