Skip to content

docs: add V3 app hang tracking design spec#8025

Draft
philprime wants to merge 5 commits into
mainfrom
docs/app-hang-tracking-v3-spec
Draft

docs: add V3 app hang tracking design spec#8025
philprime wants to merge 5 commits into
mainfrom
docs/app-hang-tracking-v3-spec

Conversation

@philprime

Copy link
Copy Markdown
Member

Summary

  • Adds develop-docs/APP-HANG-TRACKING.md — comprehensive design spec for V3 app hang tracking based on CFRunLoopObserver
  • Covers: core mechanism, tiered detection with stack trace sampling, signal categories, client-side sample aggregation, fatal hang persistence with duration recovery, event sending mechanics, option structure (options.experimental.appHangs.*), mutual exclusivity with V1/V2, platform behavior (all platforms supported, watchOS timing-only), and ANR Tracker V3 wrapper architecture
  • Includes comparison matrix (V1 vs V2 vs V3), annotated open GitHub issues, and documented V1/V2 for reference

Motivation

V1 and V2 have fundamental limitations: polling overhead when idle, coarse detection granularity, single-snapshot stack traces that may miss the root cause, and no fatal hang duration recovery. V3 replaces both with an event-driven CFRunLoopObserver approach that provides precise timing, sampled stack traces with confidence metrics, and incremental persistence for fatal hang recovery.

This is a living design document for discussion and iteration before implementation begins.

#skip-changelog

Comment thread develop-docs/APP-HANG-TRACKING.md
@philprime philprime self-assigned this Jun 9, 2026

@itaybre itaybre left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some comments here and there

Comment thread develop-docs/APP-HANG-TRACKING.md Outdated
Comment thread develop-docs/APP-HANG-TRACKING.md
Comment thread develop-docs/APP-HANG-TRACKING.md Outdated
Comment thread develop-docs/APP-HANG-TRACKING.md
Comment thread develop-docs/APP-HANG-TRACKING.md Outdated
Comment thread develop-docs/APP-HANG-TRACKING.md Outdated
- Soften overhead claims from "none/minimal" to "low" with
  acknowledgment that benchmarking is needed
- Clarify deprecated option removal targets v10
- Add SentryCrash decoupling plan for stack trace capture
- Document that mechanism.data fields pass through relay and
  render as pills in the frontend with no backend changes needed
- Fix watchOS fatal duration recovery: semaphore timeouts still
  update last_sample_time even without stack trace capture
- Specify atomic writes for persistence to avoid partial files
- Add note on mid-RunLoop suspension handling

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1e1396e. Configure here.

Comment thread develop-docs/APP-HANG-TRACKING.md

- Later samples are more representative of the blocking call that sustained the hang
- Early samples near `samplingThreshold` may capture setup/transition code rather than the actual blocker
- The `samples_total` counter still reflects all captured samples (including evicted ones), while `representative_count` and `sample_groups` only reflect retained samples. This allows the confidence metric to indicate when significant data was lost.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confidence ignores evicted samples

Medium Severity

The doc says ring-buffer eviction lets confidence show significant sample loss, but confidence is defined as representative_count divided by retained samples only. After eviction, samples_total can be much larger while confidence stays near 1.0, so readers misinterpret attribution quality.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1e1396e. Configure here.

Comment thread develop-docs/APP-HANG-TRACKING.md
@philprime philprime added the ready-to-merge Use this label to trigger all PR workflows label Jun 10, 2026
@naftaly

naftaly commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Just butting in here because the spec read was amazing. If you're interested in saving yourselves some cycles, you might want to check out the way KSCrash has it setup, and either participate in improving it or just using it outright. I think this is going out in 2.6 which is in beta now. But regardless, the spec looks great to me :)

https://github.com/kstenerud/KSCrash/blob/develop/Sources/KSCrashRecording/Monitors/KSCrashMonitor_Watchdog.c

Compare V3's semaphore-based approach to KSCrash's
CFRunLoopTimer-based watchdog monitor. Document the timer
vs semaphore decision, stack trace capture differences, and
lock-then-suspend ordering constraint for the wrapper.
| -------------------- | ----------------------------------------------------------- | ------------------------------------------------------------- |
| Escalation mechanism | `CFRunLoopTimer` on a watchdog `CFRunLoop` | `DispatchSemaphore` timeout loop on a serial background queue |
| Threshold | Single (250ms) | Tiered (25ms hitch → 250ms sampling → 2s hang) |
| Stack trace capture | Once at detection, suspends all threads | Sampled every 250ms, main thread only |

@naftaly naftaly Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, the TimeProfiler does the sampling, the watchdog only fires that the issue exist and reports it. We chose this method as they are distinct from each other and can easily be joined later on remotely as needed.

See KSCrash+Hang.h

| Suspension detection | Not implemented | Wall-clock delta check (`actualWait > expectedWait * 3.0`) |
| Timer lifecycle | Created per `afterWaiting`, invalidated per `beforeWaiting` | Semaphore created per run loop iteration |

### Timer vs Semaphore Decision

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a semaphore was considered for KSCrash. We chose the timer instead as it allows the OS to take a much larger stand as to how things should work vs. forcing it into our way with a semaphore. Also, there's the whole semaphores don't play well with a lot of the concurrency systems (priority inversion, etc, ...).

@philprime philprime Jun 11, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noahsmartin I believe you advocated for using a semaphore in our case. Did we consider challenges in concurrency systems?

| Stack trace capture | Once at detection, suspends all threads | Sampled every 250ms, main thread only |
| Persistence | Full JSON report once + 24-byte mmap'd binary sidecar | Full JSON overwrite per sample |
| Thread safety | `os_unfair_lock` + `_Atomic uint64_t` for `enterTime` | Observers lock + serial background queue |
| Suspension detection | Not implemented | Wall-clock delta check (`actualWait > expectedWait * 3.0`) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KSCrash prefers to leave this up to the user of the framework as they may want to deal with this kind of suspension in a different way.


### Stack Trace Capture: All Threads vs Main Thread Only

KSCrash suspends **all threads** via `ksmc_suspendEnvironment()` (`task_threads()` + `thread_suspend()` on each) and captures stack traces for every thread. V3 suspends **only the main thread** via a single `thread_suspend()` call and captures only the main thread's stack trace.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could definitely document this better. The goal was to have a regular "termination" report for when a hang starts. This gives us a few things:

  • A report ready if this turns into an actual watchdog timeout termination.
  • All threads captured make it much easier to understand a hang from a diagnostics POV.
  • If chosen, a TimeProfile can also be run to get only the main thread at intervals in order to view the hang as some sort of graph. See this profile for example.

@naftaly naftaly left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks so much for doing the comparison. Please reach out of we can be of any help understanding some choices we may have not documented well or just want to talk about it.

@philprime

Copy link
Copy Markdown
Member Author

Thank you @naftaly for your feedback! We'll be in touch to see how we can collaborate on this.

@philprime philprime marked this pull request as draft June 15, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge Use this label to trigger all PR workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants