docs: add V3 app hang tracking design spec by philprime · Pull Request #8025 · getsentry/sentry-cocoa

philprime · 2026-06-09T11:09:12Z

Summary

Adds develop-docs/APP-HANG-TRACKING.md — comprehensive design spec for V3 app hang tracking based on CFRunLoopObserver
Covers: core mechanism, tiered detection with stack trace sampling, signal categories, client-side sample aggregation, fatal hang persistence with duration recovery, event sending mechanics, option structure (options.experimental.appHangs.*), mutual exclusivity with V1/V2, platform behavior (all platforms supported, watchOS timing-only), and ANR Tracker V3 wrapper architecture
Includes comparison matrix (V1 vs V2 vs V3), annotated open GitHub issues, and documented V1/V2 for reference

Motivation

V1 and V2 have fundamental limitations: polling overhead when idle, coarse detection granularity, single-snapshot stack traces that may miss the root cause, and no fatal hang duration recovery. V3 replaces both with an event-driven CFRunLoopObserver approach that provides precise timing, sampled stack traces with confidence metrics, and incremental persistence for fatal hang recovery.

This is a living design document for discussion and iteration before implementation begins.

#skip-changelog

itaybre

LGTM, some comments here and there

- Soften overhead claims from "none/minimal" to "low" with acknowledgment that benchmarking is needed - Clarify deprecated option removal targets v10 - Add SentryCrash decoupling plan for stack trace capture - Document that mechanism.data fields pass through relay and render as pills in the frontend with no backend changes needed - Fix watchOS fatal duration recovery: semaphore timeouts still update last_sample_time even without stack trace capture - Specify atomic writes for persistence to avoid partial files - Add note on mid-RunLoop suspension handling

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 1e1396e. Configure here.}

cursor · 2026-06-10T09:04:03Z

+
+- Later samples are more representative of the blocking call that sustained the hang
+- Early samples near `samplingThreshold` may capture setup/transition code rather than the actual blocker
+- The `samples_total` counter still reflects all captured samples (including evicted ones), while `representative_count` and `sample_groups` only reflect retained samples. This allows the confidence metric to indicate when significant data was lost.


Confidence ignores evicted samples

Medium Severity

The doc says ring-buffer eviction lets confidence show significant sample loss, but confidence is defined as representative_count divided by retained samples only. After eviction, samples_total can be much larger while confidence stays near 1.0, so readers misinterpret attribution quality.

Additional Locations (2)

develop-docs/APP-HANG-TRACKING.md#L309-L312

develop-docs/APP-HANG-TRACKING.md#L347-L350

^{Reviewed by Cursor Bugbot for commit 1e1396e. Configure here.}

naftaly · 2026-06-10T20:23:26Z

Just butting in here because the spec read was amazing. If you're interested in saving yourselves some cycles, you might want to check out the way KSCrash has it setup, and either participate in improving it or just using it outright. I think this is going out in 2.6 which is in beta now. But regardless, the spec looks great to me :)

https://github.com/kstenerud/KSCrash/blob/develop/Sources/KSCrashRecording/Monitors/KSCrashMonitor_Watchdog.c

Compare V3's semaphore-based approach to KSCrash's CFRunLoopTimer-based watchdog monitor. Document the timer vs semaphore decision, stack trace capture differences, and lock-then-suspend ordering constraint for the wrapper.

naftaly · 2026-06-11T14:20:56Z

+| -------------------- | ----------------------------------------------------------- | ------------------------------------------------------------- |
+| Escalation mechanism | `CFRunLoopTimer` on a watchdog `CFRunLoop`                  | `DispatchSemaphore` timeout loop on a serial background queue |
+| Threshold            | Single (250ms)                                              | Tiered (25ms hitch → 250ms sampling → 2s hang)                |
+| Stack trace capture  | Once at detection, suspends all threads                     | Sampled every 250ms, main thread only                         |


fwiw, the TimeProfiler does the sampling, the watchdog only fires that the issue exist and reports it. We chose this method as they are distinct from each other and can easily be joined later on remotely as needed.

See KSCrash+Hang.h

naftaly · 2026-06-11T14:24:27Z

+| Suspension detection | Not implemented                                             | Wall-clock delta check (`actualWait > expectedWait * 3.0`)    |
+| Timer lifecycle      | Created per `afterWaiting`, invalidated per `beforeWaiting` | Semaphore created per run loop iteration                      |
+
+### Timer vs Semaphore Decision


Using a semaphore was considered for KSCrash. We chose the timer instead as it allows the OS to take a much larger stand as to how things should work vs. forcing it into our way with a semaphore. Also, there's the whole semaphores don't play well with a lot of the concurrency systems (priority inversion, etc, ...).

@noahsmartin I believe you advocated for using a semaphore in our case. Did we consider challenges in concurrency systems?

naftaly · 2026-06-11T14:25:09Z

+| Stack trace capture  | Once at detection, suspends all threads                     | Sampled every 250ms, main thread only                         |
+| Persistence          | Full JSON report once + 24-byte mmap'd binary sidecar       | Full JSON overwrite per sample                                |
+| Thread safety        | `os_unfair_lock` + `_Atomic uint64_t` for `enterTime`       | Observers lock + serial background queue                      |
+| Suspension detection | Not implemented                                             | Wall-clock delta check (`actualWait > expectedWait * 3.0`)    |


KSCrash prefers to leave this up to the user of the framework as they may want to deal with this kind of suspension in a different way.

naftaly · 2026-06-11T14:30:23Z

+
+### Stack Trace Capture: All Threads vs Main Thread Only
+
+KSCrash suspends **all threads** via `ksmc_suspendEnvironment()` (`task_threads()` + `thread_suspend()` on each) and captures stack traces for every thread. V3 suspends **only the main thread** via a single `thread_suspend()` call and captures only the main thread's stack trace.


We could definitely document this better. The goal was to have a regular "termination" report for when a hang starts. This gives us a few things:

A report ready if this turns into an actual watchdog timeout termination.

All threads captured make it much easier to understand a hang from a diagnostics POV.

If chosen, a TimeProfile can also be run to get only the main thread at intervals in order to view the hang as some sort of graph. See this profile for example.

naftaly

Amazing, thanks so much for doing the comparison. Please reach out of we can be of any help understanding some choices we may have not documented well or just want to talk about it.

philprime · 2026-06-11T14:54:22Z

Thank you @naftaly for your feedback! We'll be in touch to see how we can collaborate on this.

docs: add V3 app hang tracking design spec

8adf6e5

philprime requested review from NinjaLikesCheez, itaybre, noahsmartin and philipphofmann as code owners June 9, 2026 11:09

cursor Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread develop-docs/APP-HANG-TRACKING.md

philprime self-assigned this Jun 9, 2026

docs: add TL;DR summary to app hang tracking spec

1144e23

itaybre approved these changes Jun 9, 2026

View reviewed changes

philprime added 2 commits June 10, 2026 10:48

clarify overhead details

b5aa3c0

cursor Bot reviewed Jun 10, 2026

View reviewed changes

philprime added the ready-to-merge Use this label to trigger all PR workflows label Jun 10, 2026

docs: add KSCrash watchdog comparison to hang tracking spec

2cde2bd

Compare V3's semaphore-based approach to KSCrash's CFRunLoopTimer-based watchdog monitor. Document the timer vs semaphore decision, stack trace capture differences, and lock-then-suspend ordering constraint for the wrapper.

naftaly reviewed Jun 11, 2026

View reviewed changes

philprime marked this pull request as draft June 15, 2026 14:18


		### Stack Trace Capture: All Threads vs Main Thread Only

		KSCrash suspends all threads via `ksmc_suspendEnvironment()` (`task_threads()` + `thread_suspend()` on each) and captures stack traces for every thread. V3 suspends only the main thread via a single `thread_suspend()` call and captures only the main thread's stack trace.

Uh oh!

Conversation

philprime commented Jun 9, 2026

Summary

Motivation

Uh oh!

Uh oh!

itaybre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Confidence ignores evicted samples

Uh oh!

Uh oh!

naftaly commented Jun 10, 2026

Uh oh!

naftaly Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naftaly Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

philprime Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naftaly Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

naftaly Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

naftaly left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philprime commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

naftaly Jun 11, 2026 •

edited

Loading

philprime Jun 11, 2026 •

edited

Loading

naftaly left a comment •

edited

Loading