docs: add V3 app hang tracking design spec#8025
Conversation
itaybre
left a comment
There was a problem hiding this comment.
LGTM, some comments here and there
- Soften overhead claims from "none/minimal" to "low" with acknowledgment that benchmarking is needed - Clarify deprecated option removal targets v10 - Add SentryCrash decoupling plan for stack trace capture - Document that mechanism.data fields pass through relay and render as pills in the frontend with no backend changes needed - Fix watchOS fatal duration recovery: semaphore timeouts still update last_sample_time even without stack trace capture - Specify atomic writes for persistence to avoid partial files - Add note on mid-RunLoop suspension handling
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1e1396e. Configure here.
|
|
||
| - Later samples are more representative of the blocking call that sustained the hang | ||
| - Early samples near `samplingThreshold` may capture setup/transition code rather than the actual blocker | ||
| - The `samples_total` counter still reflects all captured samples (including evicted ones), while `representative_count` and `sample_groups` only reflect retained samples. This allows the confidence metric to indicate when significant data was lost. |
There was a problem hiding this comment.
Confidence ignores evicted samples
Medium Severity
The doc says ring-buffer eviction lets confidence show significant sample loss, but confidence is defined as representative_count divided by retained samples only. After eviction, samples_total can be much larger while confidence stays near 1.0, so readers misinterpret attribution quality.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 1e1396e. Configure here.
|
Just butting in here because the spec read was amazing. If you're interested in saving yourselves some cycles, you might want to check out the way KSCrash has it setup, and either participate in improving it or just using it outright. I think this is going out in 2.6 which is in beta now. But regardless, the spec looks great to me :) |
Compare V3's semaphore-based approach to KSCrash's CFRunLoopTimer-based watchdog monitor. Document the timer vs semaphore decision, stack trace capture differences, and lock-then-suspend ordering constraint for the wrapper.
| | -------------------- | ----------------------------------------------------------- | ------------------------------------------------------------- | | ||
| | Escalation mechanism | `CFRunLoopTimer` on a watchdog `CFRunLoop` | `DispatchSemaphore` timeout loop on a serial background queue | | ||
| | Threshold | Single (250ms) | Tiered (25ms hitch → 250ms sampling → 2s hang) | | ||
| | Stack trace capture | Once at detection, suspends all threads | Sampled every 250ms, main thread only | |
There was a problem hiding this comment.
fwiw, the TimeProfiler does the sampling, the watchdog only fires that the issue exist and reports it. We chose this method as they are distinct from each other and can easily be joined later on remotely as needed.
See KSCrash+Hang.h
| | Suspension detection | Not implemented | Wall-clock delta check (`actualWait > expectedWait * 3.0`) | | ||
| | Timer lifecycle | Created per `afterWaiting`, invalidated per `beforeWaiting` | Semaphore created per run loop iteration | | ||
|
|
||
| ### Timer vs Semaphore Decision |
There was a problem hiding this comment.
Using a semaphore was considered for KSCrash. We chose the timer instead as it allows the OS to take a much larger stand as to how things should work vs. forcing it into our way with a semaphore. Also, there's the whole semaphores don't play well with a lot of the concurrency systems (priority inversion, etc, ...).
There was a problem hiding this comment.
@noahsmartin I believe you advocated for using a semaphore in our case. Did we consider challenges in concurrency systems?
| | Stack trace capture | Once at detection, suspends all threads | Sampled every 250ms, main thread only | | ||
| | Persistence | Full JSON report once + 24-byte mmap'd binary sidecar | Full JSON overwrite per sample | | ||
| | Thread safety | `os_unfair_lock` + `_Atomic uint64_t` for `enterTime` | Observers lock + serial background queue | | ||
| | Suspension detection | Not implemented | Wall-clock delta check (`actualWait > expectedWait * 3.0`) | |
There was a problem hiding this comment.
KSCrash prefers to leave this up to the user of the framework as they may want to deal with this kind of suspension in a different way.
|
|
||
| ### Stack Trace Capture: All Threads vs Main Thread Only | ||
|
|
||
| KSCrash suspends **all threads** via `ksmc_suspendEnvironment()` (`task_threads()` + `thread_suspend()` on each) and captures stack traces for every thread. V3 suspends **only the main thread** via a single `thread_suspend()` call and captures only the main thread's stack trace. |
There was a problem hiding this comment.
We could definitely document this better. The goal was to have a regular "termination" report for when a hang starts. This gives us a few things:
- A report ready if this turns into an actual watchdog timeout termination.
- All threads captured make it much easier to understand a hang from a diagnostics POV.
- If chosen, a TimeProfile can also be run to get only the main thread at intervals in order to view the hang as some sort of graph. See this profile for example.
|
Thank you @naftaly for your feedback! We'll be in touch to see how we can collaborate on this. |


Summary
develop-docs/APP-HANG-TRACKING.md— comprehensive design spec for V3 app hang tracking based on CFRunLoopObserveroptions.experimental.appHangs.*), mutual exclusivity with V1/V2, platform behavior (all platforms supported, watchOS timing-only), and ANR Tracker V3 wrapper architectureMotivation
V1 and V2 have fundamental limitations: polling overhead when idle, coarse detection granularity, single-snapshot stack traces that may miss the root cause, and no fatal hang duration recovery. V3 replaces both with an event-driven CFRunLoopObserver approach that provides precise timing, sampled stack traces with confidence metrics, and incremental persistence for fatal hang recovery.
This is a living design document for discussion and iteration before implementation begins.
#skip-changelog