Skip to content

Bound HTTP transaction lifetime#3484

Merged
kenhuuu merged 2 commits into
masterfrom
tx-idle-to
Jun 30, 2026
Merged

Bound HTTP transaction lifetime#3484
kenhuuu merged 2 commits into
masterfrom
tx-idle-to

Conversation

@kenhuuu

@kenhuuu kenhuuu commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Bound HTTP transaction lifetime: suspend idle timer while busy + add maxTransactionLifetime

What changed

Two related changes to how Gremlin Server bounds the lifetime of an HTTP transaction (each open transaction owns a dedicated worker thread and a maxConcurrentTransactions slot, so an unbounded transaction is a real resource
leak).

Idle timer suspends while busy. The transactionTimeout idle timer was armed on request arrival, so a single operation running longer than the timeout would trip it mid-execution and roll back a perfectly healthy
transaction. It now arms only when the transaction goes genuinely idle (no operation running or queued) and is suspended while work is in flight — a long operation is bounded by evaluationTimeout, not the idle timer. The
setting is renamed transactionTimeoutidleTransactionTimeout to reflect what it actually means, and 0 now correctly disables it (the docs always claimed this; the code never honored it).

New maxTransactionLifetime absolute cap. A new setting that bounds total transaction age regardless of activity, closing the gap where a client holds a transaction open indefinitely via one long operation or
keep-alive drips. When it fires it interrupts the running operation and rolls the transaction back, and the in-flight client receives a transaction-timeout (504) rather than a misleading "increase evaluationTimeout" error.

Why

The committed idle-timer behavior acted more like a per request timeout instead of the described idle timeout and there was no ceiling on transaction lifetime at all. Together these give three composable bounds —
per-operation (evaluationTimeout), between-operations (idleTransactionTimeout), and whole-transaction (maxTransactionLifetime) — mirroring PostgreSQL's statement_timeout / idle_in_transaction_session_timeout /
transaction_timeout.

Notably, the server does not validate these settings or reject begins when bounds are disabled. Instead it ships sane defaults (idle 1 min, lifetime 10 min): a transaction is bounded out of the box, disabling the bounds
is a deliberate operator choice, and a client's per-request timeoutMs is always honored as sent rather than second-guessed.

Review guide

  • UnmanagedTransaction — the executor swap. The single-thread executor is now a ThreadPoolExecutor(1,1) subclass purely to expose beforeExecute/afterExecute + the queue, which drive the suspend-while-busy logic.
    The key invariant: submitted tasks must not be wrappedsubmit() returns the same FutureTask so the eval-timeout / cap cancel(true) interrupts the real work.
  • Concurrency on the idle/cap timers. maybeScheduleIdleTimer re-checks accepting after arming (so a concurrent close() can't be raced into re-arming a dying transaction); the in-flight op is tracked as a single
    immutable Running(future, context) pair so the cap never flags one operation's Context while interrupting another's future.
  • Ownership split (intentional asymmetry). The idle timer lives in UnmanagedTransaction (it must see the executor hooks); the lifetime cap is scheduled/cancelled by TransactionManager (a fixed schedule tied to
    registry membership). The cap is armed after putIfAbsent so it can never fire into an unregistered transaction and leak a thread; destroy() cancels it on every close path.
  • close() ordering is still load-bearingmanager.destroy() before executor.shutdown(), graceful shutdown() (not shutdownNow()). The cap path reuses this exact path; verify it's unchanged.
  • Error mapping — cap-kill → 504 TransactionException via a Context.closedByLifetimeCap flag set before the interrupt; ordinary eval timeout still → 500. Both the eval-timeout writer and formErrorResponseMessage are
    cap-aware so the code is correct regardless of which thread writes the response first.
  • Tests — deterministic timer behavior is unit-tested via a virtual-clock ManualScheduledExecutorService (no Thread.sleep flakiness); integration tests assert the guarantee actually made (transaction reclaimed /
    subsequent 404), since timing can't reliably catch the mid-op interrupt.

VOTE +1

@xiazcy

xiazcy commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

VOTE +1

1 similar comment
@Cole-Greer

Copy link
Copy Markdown
Contributor

VOTE +1

kenhuuu added 2 commits June 30, 2026 12:21
The idle timeout was armed on request arrival rather than when the
transaction went idle, so a single operation running longer than the timeout
tripped it mid-execution, contradicting the documented promise that active
transactions are unaffected. A long operation should be bounded by
evaluationTimeout; the idle timer should only reclaim abandoned transactions.

The per-transaction executor is now a ThreadPoolExecutor(1,1) whose
before/afterExecute hooks suspend the idle timer while work runs and re-arm it
only once the worker parks with an empty queue. This gives a reliable
running-vs-idle signal without wrapping submitted tasks, which would break the
evaluation-timeout interrupt that relies on cancelling the real FutureTask.

transactionTimeout is renamed to idleTransactionTimeout to reflect its actual
meaning (renamed outright as the feature is unreleased), and now honors 0 as
"disabled" to match its documentation.

Assisted-by: Claude Code:claude-opus-4-8
The idle timeout only reclaims transactions that go quiet; a client could
still hold a transaction (and its dedicated worker thread and concurrency
slot) open indefinitely with a single long operation or a keep-alive drip.
maxTransactionLifetime bounds total transaction age regardless of activity:
when it fires it interrupts the running operation and rolls the transaction
back, so the in-flight client gets a transaction-timeout (504) rather than a
misleading evaluation-timeout error.

Rather than validate timeout configuration and fail begins (or silently
override a client's timeoutMs) when bounds are disabled, the server ships
sane defaults instead: idle reclamation at 1 minute and a lifetime cap at 10
minutes. A transaction is bounded out of the box, disabling the bounds is a
deliberate operator choice, and a per-request timeoutMs is always honored as
sent rather than second-guessed on the client's behalf.

Assisted-by: Claude Code:claude-opus-4-8
@kenhuuu kenhuuu merged commit 0a149b8 into master Jun 30, 2026
47 of 48 checks passed
@kenhuuu kenhuuu deleted the tx-idle-to branch June 30, 2026 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants