fix: windows shutdown#695
Conversation
|
Here is more information. 1. Problem statementDetailsConfiguration:
Observed behavior:
Expected behavior:
2. Root cause analysisDetailsThe hang is not caused by a single bug. It is the composition of six Layer 1 -- pydevd single-steps every line on StopWhen PyCharm initiates a stop in Consequence: any Python code that runs after Stop has been pressed can be Additionally, pydevd injects a Layer 2 -- Windows Job Objects trap subprocess childrenPyCharm places the debugged process in a Windows Job Object with Consequence: when PyCharm tears the debug session down, the OS closes the job The standard escape ( Layer 3 -- pydevd patches
|
| Alternative | Fails because |
|---|---|
taskkill /F /T /PID from a job-escaped child |
Layer 5: parent may already be dead |
manager.terminate() in KI handler |
Layer 1: pydevd suspends the handler mid-execution |
daemon=True on uvicorn workers |
Hard constraint: workers must be non-daemon |
daemon=True on the SyncManager |
mp.Manager() does not expose this knob |
sys.settrace tricks to evade pydevd |
Fragile, pydevd-version-specific |
subprocess.Popen sentinel |
Layer 3: pydevd patches Popen |
CREATE_BREAKAWAY_FROM_JOB only |
Layer 2: not all jobs permit breakaway |
The accepted fix defeats every layer with the minimum set of mechanisms:
- ctypes
CreateProcessW-- defeats Layer 3 (Popen patch). - WMI relay under
WmiPrvSE-- defeats Layer 2 (Job Object). - External sentinel running before Stop -- defeats Layer 1 (line-stepping).
- Filesystem heartbeat -- defeats Layer 1 (no IPC for pydevd to intercept).
CreateToolhelp32Snapshot+ BFS -- defeats Layers 4 and 5 (orphans with
dead parents).- Single timeout, no
last_livecounter -- defeats Layer 6 (double delay).
No normal-path code is touched. The sentinel is only spawned when
_is_pydevd_active() returns true at startup, so production runs pay zero
overhead.
6. Things to keep in mind when modifying this code
Details
- The sentinel is a real installed file.
_child.pylives in the package
and is located at runtime viaos.path.dirname(__file__). Do not move it
without updatingstart_heartbeat_sentinel. - Do not spawn the sentinel via
subprocess.Popen. Use
_create_process_no_window.Popengoes through pydevd's patch. - Do not switch the sentinel kill back to
taskkill. It cannot find
orphans whose parent has exited. - Heartbeat path must be unique per PID (
litserve_hb_{pid}.tmp). A fixed
path means a stale sentinel from a previous run will kill the new server. kill_delayis a budget, not a guarantee. It bounds how long after the
heartbeat goes stale the kill fires. The heartbeat updates every 0.5s, so
there is ~6x safety margin at 3.0s. Do not lower it below ~1.5s._spawn_external_killermust stay deleted. Callingsubprocess.Popen
from inside the KI handler gives pydevd more lines to step-trace, making
Scenario B more likely. Any future "fast-path killer" from inside Python must
use ctypes only and should be kept as short as possible.- Thread-based uvicorn workers (Windows). On Windows,
api_server_worker_type
is forced to"thread".threading.Threadhas no.terminate()or.kill().
_perform_graceful_shutdownalready branches onisinstance(uw, threading.Thread);
keep this branch intact.
3ea7af3 to
e17b980
Compare
46518e6 to
fc91606
Compare
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #695 +/- ##
====================================
Coverage 85% 85%
====================================
Files 39 41 +2
Lines 3282 3387 +105
====================================
+ Hits 2778 2872 +94
- Misses 504 515 +11 🚀 New features to boost your workflow:
|
What does this PR do?
Fixes #684.
Before submitting
Details of the issue and fix.
This was quite tricky to fix, where should I add the documentation and findings?
When a LitServe server is run under PyCharm's debugger with
--multiprocess, clicking Stop can hang the run configuration indefinitely. Two scenarios occur non-deterministically:KeyboardInterrupthandler run to completion._perform_graceful_shutdownfinishes, process exits in ~1s. That's desired. This happens quite rarely.KeyboardInterrupt) KI handler may never complete.Meanwhile
multiprocessing.Manager()'sSyncManagerchild has pydevd's non-daemonCheckAliveThreadinjected into it, so it refuses to exit.PyCharm's session bookkeeping waits on every debugged PID, so the orphaned
SyncManagerholds the session open indefinitely.Layered constraints make the obvious fixes fail: PyCharm puts the debugged process in a Job Object (
JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE), so anysubprocess.Popen-spawned killer is co-managed by pydevd and dies with the job. Andtaskkill /F /T /PID <p>fails with rc=128 once<p>has already exited, because it walks the active process list -- orphaned children whoseth32ParentProcessIDstill points at the dead parent are invisible to it.Fix: a pre-spawned external sentinel started at server startup (only when
_is_pydevd_active()is true) viactypes.CreateProcessW->powershell.exe -File->Invoke-WmiMethod Win32_Process.Create. The final Python process is a child ofWmiPrvSE, outside PyCharm's Job Object, with no pydevd injected. It monitors a filesystem heartbeat (os.utimein the main loop). When the heartbeat goes stale by >kill_delayseconds, or when the main PID is no longer alive, it callsCreateToolhelp32Snapshotto BFS-enumerate all descendants of the main PID and callsTerminateProcesson each -- this finds orphaned children even after their parent has exited, becauseth32ParentProcessIDis fixed at creation time.New files:
src/litserve/_win_shutdown_fix/__init__.py--_create_process_no_windowandstart_heartbeat_sentinel(the launcher, runs in the main process)src/litserve/_win_shutdown_fix/_child.py-- the sentinel child process (a real installed file, invoked directly by path)Other changes in
server.py:_is_pydevd_active()helper (checks"pydevd" in sys.modulesat startup)self._shutdown_event.wait()under pydevd, to keep the heartbeat file alive while waiting for shutdown_perform_graceful_shutdownnow branches onisinstance(uw, threading.Thread)to avoid calling.terminate()on thread-based uvicorn workers (Windows forces thread mode), which do not have that methodThe previous
_spawn_external_killer(called from inside the KI handler viasubprocess.Popen) was making Scenario B worse: pydevd's patchedPopenadded more trace points for pydevd to step through, increasing the probability of the handler suspending mid-cleanup. It has been removed.PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Kinda, this was frustrating to hunt down and fix.