Skip to content

fix: windows shutdown#695

Open
racinmat wants to merge 6 commits into
Lightning-AI:mainfrom
racinmat:racinsky/fix-win-worker
Open

fix: windows shutdown#695
racinmat wants to merge 6 commits into
Lightning-AI:mainfrom
racinmat:racinsky/fix-win-worker

Conversation

@racinmat

@racinmat racinmat commented Jun 6, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes #684.

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
    • No, nobody answered to it. I'd love to discuss it if anyone wanted.
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
    • There are no docs in this repo, where should I put the docs?
  • Did you write any new necessary tests?
    • This is very hard to test.

Details of the issue and fix.

This was quite tricky to fix, where should I add the documentation and findings?

When a LitServe server is run under PyCharm's debugger with --multiprocess, clicking Stop can hang the run configuration indefinitely. Two scenarios occur non-deterministically:

  • Scenario A -- pydevd lets the KeyboardInterrupt handler run to completion. _perform_graceful_shutdown finishes, process exits in ~1s. That's desired. This happens quite rarely.
  • Scenario B -- pydevd flips the main thread into single-step mode on Stop and suspends it between bytecode lines, so the (KeyboardInterrupt) KI handler may never complete.
    Meanwhile multiprocessing.Manager()'s SyncManager child has pydevd's non-daemon CheckAliveThread injected into it, so it refuses to exit.
    PyCharm's session bookkeeping waits on every debugged PID, so the orphaned SyncManager holds the session open indefinitely.

Layered constraints make the obvious fixes fail: PyCharm puts the debugged process in a Job Object (JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE), so any subprocess.Popen-spawned killer is co-managed by pydevd and dies with the job. And taskkill /F /T /PID <p> fails with rc=128 once <p> has already exited, because it walks the active process list -- orphaned children whose th32ParentProcessID still points at the dead parent are invisible to it.

Fix: a pre-spawned external sentinel started at server startup (only when _is_pydevd_active() is true) via ctypes.CreateProcessW -> powershell.exe -File -> Invoke-WmiMethod Win32_Process.Create. The final Python process is a child of WmiPrvSE, outside PyCharm's Job Object, with no pydevd injected. It monitors a filesystem heartbeat (os.utime in the main loop). When the heartbeat goes stale by > kill_delay seconds, or when the main PID is no longer alive, it calls CreateToolhelp32Snapshot to BFS-enumerate all descendants of the main PID and calls TerminateProcess on each -- this finds orphaned children even after their parent has exited, because th32ParentProcessID is fixed at creation time.

New files:

  • src/litserve/_win_shutdown_fix/__init__.py -- _create_process_no_window and start_heartbeat_sentinel (the launcher, runs in the main process)
  • src/litserve/_win_shutdown_fix/_child.py -- the sentinel child process (a real installed file, invoked directly by path)

Other changes in server.py:

  • _is_pydevd_active() helper (checks "pydevd" in sys.modules at startup)
  • Polling loop replaces self._shutdown_event.wait() under pydevd, to keep the heartbeat file alive while waiting for shutdown
  • _perform_graceful_shutdown now branches on isinstance(uw, threading.Thread) to avoid calling .terminate() on thread-based uvicorn workers (Windows forces thread mode), which do not have that method

The previous _spawn_external_killer (called from inside the KI handler via subprocess.Popen) was making Scenario B worse: pydevd's patched Popen added more trace points for pydevd to step through, increasing the probability of the handler suspending mid-cleanup. It has been removed.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Kinda, this was frustrating to hunt down and fix.

@racinmat

racinmat commented Jun 6, 2026

Copy link
Copy Markdown
Author

Here is more information.

1. Problem statement

Details

Configuration:

  • PyCharm Professional, Windows 10/11
  • Python LitServe server launched via Run/Debug configuration
  • Debugger flag: --multiprocess (pydevd attaches to child processes)
  • Server uses multiprocessing.Manager() (creates a SyncManager child) and
    uvicorn workers

Observed behavior:

  • User clicks the red Stop button in PyCharm.
  • PyCharm reports the run configuration as "running" indefinitely.
  • The main Python process exits within a few seconds, but at least one child
    (the SyncManager) lingers, holding the PyCharm session open.
  • Total hang: 5+ minutes (until PyCharm gives up).

Expected behavior:

  • Clean shutdown within ~10 seconds.

2. Root cause analysis

Details

The hang is not caused by a single bug. It is the composition of six
independent Windows / pydevd behaviors. Each layer must be defeated.

Layer 1 -- pydevd single-steps every line on Stop

When PyCharm initiates a stop in --multiprocess mode, pydevd flips the main
thread into single-step mode via sys.settrace to suspend the interpreter
between every Python bytecode line.

Consequence: any Python code that runs after Stop has been pressed can be
suspended between two consecutive statements. Cleanup handlers (signal
handlers, try/except KeyboardInterrupt blocks, finally blocks) cannot be
relied upon to complete in bounded time.

Additionally, pydevd injects a CheckAliveThread (with daemon=False) into
every spawned subprocess. Non-daemon threads block interpreter exit, so any
process pydevd has attached to will refuse to die naturally even after its
main thread returns.

Layer 2 -- Windows Job Objects trap subprocess children

PyCharm places the debugged process in a Windows Job Object with
JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE. Any process spawned via the normal
subprocess.Popen path inherits the same job.

Consequence: when PyCharm tears the debug session down, the OS closes the job
handle and the kernel kills every process in the job -- including any
"external killer" we might try to spawn from inside Python. An escape
mechanism is required.

The standard escape (CREATE_BREAKAWAY_FROM_JOB flag) only works if the job
allows breakaway. PyCharm's job does not always allow it. The reliable escape
on Windows is to ask another, unrelated process (e.g. WMI's
WmiPrvSE.exe) to do the spawning on our behalf via Win32_Process.Create.

Layer 3 -- pydevd patches subprocess.Popen

In --multiprocess mode pydevd monkey-patches subprocess.Popen so that any
new Python child process is automatically debugged. The patch inspects the
argv, injects pydevd's bootstrapper, and rewrites the command line.

Consequence: anything spawned with subprocess.Popen goes through pydevd.
pydevd can intercept, modify, or silently fail the launch. A "sentinel"
spawned via Popen is not really external -- it is co-managed with pydevd.

To bypass this we have to spawn at a lower level than subprocess. The
lowest practical level is ctypes.windll.kernel32.CreateProcessW, which
pydevd does not patch.

Layer 4 -- The SyncManager orphan

multiprocessing.Manager() forks a separate SyncManager server process. In
--multiprocess debug mode pydevd attaches to it and installs its
CheckAliveThread.

When the main Python process exits (Scenario A, clean shutdown), the
SyncManager does not die with it -- pydevd's non-daemon CheckAliveThread
keeps it alive.

PyCharm's session bookkeeping tracks all debugged processes. It will not
declare the session finished until every PID it ever attached to has
disconnected. The orphaned SyncManager therefore holds the session open
indefinitely.

Layer 5 -- taskkill /T cannot find a dead parent

taskkill /F /T /PID <p> works by snapshotting the active process list and
walking children from <p> downward. If <p> is no longer in the active
process list, taskkill prints

ERROR: The process "<p>" not found.

and exits with rc=128. It does not scan for orphans whose
th32ParentProcessID still points at the (now-dead) PID.

This is exactly the situation we end up in: the main process exits but the
SyncManager orphan is still running with the dead PID as its parent.
taskkill /T is the wrong tool. We have to enumerate the snapshot ourselves.

Layer 6 -- double-delay bug in the sentinel

The first sentinel implementation looked like:

if age < delay:
    last_live = time.time()
if time.time() - last_live > delay:
    fire_kill()

The intent was "kill if the heartbeat has been stale for delay seconds". The
actual behavior is "wait delay seconds for the heartbeat to go stale, then
start a delay-second countdown" -- i.e. 2 * delay. With delay=20s the
effective wait was 40s.

3. Approaches tried

Details

Attempt 1 -- In-process killer fired from the KeyboardInterrupt handler

Inside except KeyboardInterrupt, fire three killer mechanisms in parallel:

  1. subprocess.Popen(["powershell.exe", "-Command", "Invoke-WmiMethod ..."])
  2. Write a .bat and launch it via os.startfile(...)
  3. subprocess.Popen([...], creationflags=CREATE_BREAKAWAY_FROM_JOB)

A breadcrumb file was opened before each call to prove the handler reached
that line.

Result: the breadcrumb file was created but never written to. The open()
call returned, then pydevd suspended the thread before the f.write(...) on
the very next line (Layer 1). Even if the writes had completed, all three
mechanisms inherit PyCharm's Job Object (Layer 2), and the Popen-based
variants go through pydevd's patched Popen (Layer 3). Additionally, this
function added many lines of Python code to the KI handler, making pydevd
suspension in Scenario B more likely and more severe.

Verdict: doing anything from inside Python after Stop is pressed is
unreliable. The killer must already be running when Stop is pressed.

Attempt 2 -- Capture _pydevd_active inside the KI handler

except KeyboardInterrupt:
    if _is_pydevd_active():
        ...

Result: _is_pydevd_active() returned False. pydevd tears itself down
before raising KeyboardInterrupt, so the detector saw a clean interpreter.

Fix: capture _pydevd_active = _is_pydevd_active() once during server startup
(before the main loop). This works, but does not solve the actual hang.

Attempt 3 -- WMI sentinel spawned via subprocess.Popen at startup

Spawn a Python sentinel at startup (before Stop is pressed) using
subprocess.Popen([powershell.exe, ..., "Invoke-WmiMethod Win32_Process.Create ..."]).

Result: the sentinel never appeared. pydevd's subprocess.Popen patch
(Layer 3) intercepted the call. We could not confirm whether pydevd modified
the args, ate the launch, or killed the PowerShell child during teardown.

Attempt 4 -- ctypes.CreateProcessW + WMI relay + embedded sentinel script

Replace subprocess.Popen with a raw call to
ctypes.windll.kernel32.CreateProcessW. pydevd does not patch ctypes calls.
The chain:

ctypes.CreateProcessW("powershell.exe -File spawn_sentinel.ps1")
    -> PowerShell: Invoke-WmiMethod -Class Win32_Process -Name Create
        -> python.exe (sentinel script)  (parent: WmiPrvSE.exe)

The final Python process is a child of WmiPrvSE, outside the Job Object,
with no pydevd injected.

The sentinel was written at runtime to a temp file from an embedded Python
string in the parent process.

Result, part A -- silent failure due to encoding. Sentinel still did not
run. Root cause: the sentinel .py file was written via open(path, "w")
with no explicit encoding. On a non-UTF-8 Windows locale this is cp1252. The
sentinel source contained an em-dash in a comment, encoded as byte \x97 in
cp1252. The child Python process read the file as UTF-8 (the default for
source files), saw an invalid byte, and aborted with a SyntaxError before
executing any code, including breadcrumb writes.

Result, part B -- taskkill /T failed because the parent was dead. After
the encoding fix the sentinel started, the heartbeat went stale, and the
sentinel fired taskkill /F /T /PID {pid}. Returned rc=128, "process not found". By the time the sentinel acted, the main process had already exited.
Layer 5 applies: taskkill /T cannot walk the tree from a dead PID.

Result, part C -- double-delay. Once the kill mechanism was fixed (see
section 4), the kill was firing at 2 * delay due to Layer 6.

Attempt 5 -- _spawn_external_killer removal

Analysis revealed that the in-handler killer was not only ineffective but
actively harmful: it added subprocess.Popen calls (pydevd-patched) to the
hot path of the KI handler, giving pydevd more lines to single-step through
and increasing the probability of Scenario B. Removing it made Scenario A
(fast clean shutdown) more common and reliable.

4. The fix

Details

4.1 Snapshot-based subtree kill

Inside _child.py, replace taskkill /T with:

def _kill_subtree(root_pid):
    snap = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0)
    # build parent -> [children] map from snapshot
    ...
    # BFS from root_pid, collect all descendants
    for cpid in descendants + [root_pid]:
        h = OpenProcess(PROCESS_TERMINATE, False, cpid)
        if h:
            TerminateProcess(h, 1)
            CloseHandle(h)

th32ParentProcessID is fixed at process-creation time. Windows does not
clear it when the parent exits, so orphaned children are found even after
their parent PID has been freed.

The sentinel triggers _kill_subtree(main_pid) in three cases:

  1. Heartbeat stale (age > kill_delay): main thread is suspended by
    pydevd. Kill the whole tree including the suspended main process.
  2. Main PID dead (OpenProcess fails): main process already exited cleanly.
    Kill any orphaned children.
  3. Heartbeat file missing (os.path.getmtime raises OSError): assume a
    fatal error and tree-kill.

A single mechanism covers both shutdown scenarios.

4.2 Fix the double-delay bug

Replace the last_live counter with a direct age check:

if age > kill_delay:
    _kill_subtree(main_pid)

kill_delay is set to 3.0s. Kill fires ~3s after the heartbeat stops
(~5s total: 2s startup grace + 3s staleness threshold).

4.3 Startup wiring

Package layout:
    src/litserve/_win_shutdown_fix/
        __init__.py    -- _create_process_no_window, start_heartbeat_sentinel
        _child.py      -- sentinel process (real installed file)

Server startup (server.py):
    _pydevd_active = _is_pydevd_active()          # "pydevd" in sys.modules, win32 only
    if _pydevd_active:
        _heartbeat_path = %TEMP%/litserve_hb_{pid}.tmp   (created empty)
        _win_shutdown_fix.start_heartbeat_sentinel(pid, heartbeat_path, kill_delay=3.0)
            -> writes %TEMP%/litserve_spawn_sentinel_{pid}.ps1
            -> _create_process_no_window(           # ctypes.CreateProcessW
                   "powershell.exe -File litserve_spawn_sentinel_{pid}.ps1"
               )
                   -> PowerShell: Invoke-WmiMethod Win32_Process.Create
                       -> python.exe _child.py {pid} {hb_path} 3.0
                              parent = WmiPrvSE, no pydevd, no Job Object

Main loop (replaces _shutdown_event.wait() under pydevd):
    while not self._shutdown_event.is_set():
        os.utime(heartbeat_path)                   # sentinel watches this
        time.sleep(0.5)

On Stop (Scenario B -- pydevd suspends KI handler):
    heartbeat stops updating
    sentinel: age > 3s -> _kill_subtree(main_pid)
        snapshot BFS -> find SyncManager (orphan of main_pid)
        TerminateProcess -> SyncManager + CheckAliveThread die
        PyCharm session ends

On Stop (Scenario A -- clean shutdown):
    KI handler completes, _perform_graceful_shutdown runs
    manager.shutdown() cleans up SyncManager
    process exits in ~1s
    sentinel detects _alive(pid)==False, kills any remaining orphans, exits

5. Why this is the simplest sufficient fix

Details
Alternative Fails because
taskkill /F /T /PID from a job-escaped child Layer 5: parent may already be dead
manager.terminate() in KI handler Layer 1: pydevd suspends the handler mid-execution
daemon=True on uvicorn workers Hard constraint: workers must be non-daemon
daemon=True on the SyncManager mp.Manager() does not expose this knob
sys.settrace tricks to evade pydevd Fragile, pydevd-version-specific
subprocess.Popen sentinel Layer 3: pydevd patches Popen
CREATE_BREAKAWAY_FROM_JOB only Layer 2: not all jobs permit breakaway

The accepted fix defeats every layer with the minimum set of mechanisms:

  • ctypes CreateProcessW -- defeats Layer 3 (Popen patch).
  • WMI relay under WmiPrvSE -- defeats Layer 2 (Job Object).
  • External sentinel running before Stop -- defeats Layer 1 (line-stepping).
  • Filesystem heartbeat -- defeats Layer 1 (no IPC for pydevd to intercept).
  • CreateToolhelp32Snapshot + BFS -- defeats Layers 4 and 5 (orphans with
    dead parents).
  • Single timeout, no last_live counter -- defeats Layer 6 (double delay).

No normal-path code is touched. The sentinel is only spawned when
_is_pydevd_active() returns true at startup, so production runs pay zero
overhead.

6. Things to keep in mind when modifying this code

Details
  • The sentinel is a real installed file. _child.py lives in the package
    and is located at runtime via os.path.dirname(__file__). Do not move it
    without updating start_heartbeat_sentinel.
  • Do not spawn the sentinel via subprocess.Popen. Use
    _create_process_no_window. Popen goes through pydevd's patch.
  • Do not switch the sentinel kill back to taskkill. It cannot find
    orphans whose parent has exited.
  • Heartbeat path must be unique per PID (litserve_hb_{pid}.tmp). A fixed
    path means a stale sentinel from a previous run will kill the new server.
  • kill_delay is a budget, not a guarantee. It bounds how long after the
    heartbeat goes stale the kill fires. The heartbeat updates every 0.5s, so
    there is ~6x safety margin at 3.0s. Do not lower it below ~1.5s.
  • _spawn_external_killer must stay deleted. Calling subprocess.Popen
    from inside the KI handler gives pydevd more lines to step-trace, making
    Scenario B more likely. Any future "fast-path killer" from inside Python must
    use ctypes only and should be kept as short as possible.
  • Thread-based uvicorn workers (Windows). On Windows, api_server_worker_type
    is forced to "thread". threading.Thread has no .terminate() or .kill().
    _perform_graceful_shutdown already branches on isinstance(uw, threading.Thread);
    keep this branch intact.

@racinmat racinmat force-pushed the racinsky/fix-win-worker branch from 3ea7af3 to e17b980 Compare June 6, 2026 15:17
@racinmat racinmat force-pushed the racinsky/fix-win-worker branch from 46518e6 to fc91606 Compare June 7, 2026 13:34
@codecov-commenter

codecov-commenter commented Jun 19, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 88.59649% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 85%. Comparing base (aaed44c) to head (a33c327).
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@         Coverage Diff          @@
##           main   #695    +/-   ##
====================================
  Coverage    85%    85%            
====================================
  Files        39     41     +2     
  Lines      3282   3387   +105     
====================================
+ Hits       2778   2872    +94     
- Misses      504    515    +11     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Threads not properly killed on windows

3 participants