Skip to content

admission: observe transient elastic CPU waiters via sticky bit#171300

Open
dt wants to merge 1 commit into
cockroachdb:masterfrom
dt:dt/elastic-cpu-sticky-waiters
Open

admission: observe transient elastic CPU waiters via sticky bit#171300
dt wants to merge 1 commit into
cockroachdb:masterfrom
dt:dt/elastic-cpu-sticky-waiters

Conversation

@dt

@dt dt commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

The elastic CPU controller's scheduler-latency listener point-samples the
WorkQueue's hasWaitingRequests at ~1Hz. The granter's tryGrant loop
drains the queue to empty as soon as tokens refill, so the queue spends
most of its time empty even under sustained throttling. The listener's
poll frequently lands in those empty windows and takes the inactive-decay
branch, pulling the utilization limit down toward inactive_point (~12%)
even when scheduler latency is well under target and there is clearly
demand for more elastic CPU.

This PR adds a sticky "had recent waiters" atomic bool on WorkQueue,
set on every Admit enqueue and cleared by an atomic Swap from the
listener each tick. A new elasticCPULimiter.hasOrHadRecentWaitingRequests
ORs this with the instantaneous hasWaitingRequests signal so any
enqueue between two ticks is durably visible to the controller, even if
the queue subsequently drained.

Fixes #170400

Epic: none

Release note (bug fix): Fix the elastic CPU admission controller holding
the elastic-work CPU utilization limit at its inactive floor (~12%) even
when there was sustained demand under the scheduling-latency target. The
controller's 1Hz poll could miss queued work that the granter drained
between ticks, causing it to incorrectly conclude there was no demand
and decay the limit.

The elastic CPU controller's scheduler-latency listener point-sampled the
WorkQueue's hasWaitingRequests at ~1Hz to decide whether to raise or
decay the utilization limit. The granter's tryGrant loop drains the
queue to empty as soon as tokens refill, so the queue spends most of
its time empty even under sustained throttling. The listener's poll
frequently landed in those empty windows and took the inactive-decay
branch, pulling the limit down toward inactive_point (~12%) even when
sched latency was well under target and there was clearly demand for
more elastic CPU.

Add a sticky "had recent waiters" atomic bool on WorkQueue, set on
every Admit enqueue and cleared by an atomic Swap from the listener
each tick. The new elasticCPULimiter.hasOrHadRecentWaitingRequests
ORs this with the instantaneous hasWaitingRequests signal so any
enqueue between two ticks is durably visible to the controller, even
if the queue subsequently drained.

Fixes cockroachdb#170400

Release note (bug fix): Fix the elastic CPU admission controller
holding the elastic-work CPU utilization limit at its inactive
floor (~12%) even when there was sustained demand under the
scheduling-latency target. The controller's 1Hz poll could miss
queued work that the granter drained between ticks, causing it to
incorrectly conclude there was no demand and decay the limit.
@trunk-io

trunk-io Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

admission: elastic CPU controller misses demand due to point-sample of hasWaitingRequests

2 participants