A patch series for the Linux kernel scheduler code is queued up for expected introduction in Linux 6.18 to defer throttle when tasks exit to user-space. These changes to switch the scheduler to a task-based throttle model and task-based throttle time accounting can provide a latency win and also address a possible deadlock situation for real-time “RT” kernels.
Queued up today in tip/tip.git’s “sched/core” Git branch are the patches for reworking the scheduler code around throttling. The status quo issue is described in the patch cover letter of defer throttle when task exits to user:
“CFS tasks can end up throttled while holding locks that other, non-throttled tasks are blocking on.
For !PREEMPT_RT, this can be a source of latency due to the throttling causing a resource acquisition denial.
For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)
…
To fix this issue for PREEMPT_RT and improve latency situation for !PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq is throttled, mark its throttled status but do not remove it from cpu’s rq. Instead, for tasks that belong to this cfs_rq, when they get picked, add a task work to them so that when they return to user, they can be dequeued. In this way, tasks throttled will not hold any kernel resources. When cfs_rq gets unthrottled, enqueue back those throttled tasks.”
With these patches now queued into the sched/core TIP branch, this task-based throttle model work should be merged for the upcoming Linux 6.18 merge window barring no objections from Linus Torvalds or other code issues from coming to light.