admission: observe transient elastic CPU waiters via sticky bit#171300
Open
dt wants to merge 1 commit into
Open
admission: observe transient elastic CPU waiters via sticky bit#171300dt wants to merge 1 commit into
dt wants to merge 1 commit into
Conversation
The elastic CPU controller's scheduler-latency listener point-sampled the WorkQueue's hasWaitingRequests at ~1Hz to decide whether to raise or decay the utilization limit. The granter's tryGrant loop drains the queue to empty as soon as tokens refill, so the queue spends most of its time empty even under sustained throttling. The listener's poll frequently landed in those empty windows and took the inactive-decay branch, pulling the limit down toward inactive_point (~12%) even when sched latency was well under target and there was clearly demand for more elastic CPU. Add a sticky "had recent waiters" atomic bool on WorkQueue, set on every Admit enqueue and cleared by an atomic Swap from the listener each tick. The new elasticCPULimiter.hasOrHadRecentWaitingRequests ORs this with the instantaneous hasWaitingRequests signal so any enqueue between two ticks is durably visible to the controller, even if the queue subsequently drained. Fixes cockroachdb#170400 Release note (bug fix): Fix the elastic CPU admission controller holding the elastic-work CPU utilization limit at its inactive floor (~12%) even when there was sustained demand under the scheduling-latency target. The controller's 1Hz poll could miss queued work that the granter drained between ticks, causing it to incorrectly conclude there was no demand and decay the limit.
Contributor
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The elastic CPU controller's scheduler-latency listener point-samples the
WorkQueue's
hasWaitingRequestsat ~1Hz. The granter'stryGrantloopdrains the queue to empty as soon as tokens refill, so the queue spends
most of its time empty even under sustained throttling. The listener's
poll frequently lands in those empty windows and takes the inactive-decay
branch, pulling the utilization limit down toward
inactive_point(~12%)even when scheduler latency is well under target and there is clearly
demand for more elastic CPU.
This PR adds a sticky "had recent waiters" atomic bool on
WorkQueue,set on every
Admitenqueue and cleared by an atomic Swap from thelistener each tick. A new
elasticCPULimiter.hasOrHadRecentWaitingRequestsORs this with the instantaneous
hasWaitingRequestssignal so anyenqueue between two ticks is durably visible to the controller, even if
the queue subsequently drained.
Fixes #170400
Epic: none
Release note (bug fix): Fix the elastic CPU admission controller holding
the elastic-work CPU utilization limit at its inactive floor (~12%) even
when there was sustained demand under the scheduling-latency target. The
controller's 1Hz poll could miss queued work that the granter drained
between ticks, causing it to incorrectly conclude there was no demand
and decay the limit.