Skip to content

UN-3648: Mark API deployment execution ERROR on synchronous staging failure#2120

Open
athul-rs wants to merge 3 commits into
mainfrom
fix/un-3647-stuck-pending
Open

UN-3648: Mark API deployment execution ERROR on synchronous staging failure#2120
athul-rs wants to merge 3 commits into
mainfrom
fix/un-3647-stuck-pending

Conversation

@athul-rs

Copy link
Copy Markdown
Contributor

What

  • Fixes API-deployment executions getting stuck in PENDING when staging fails synchronously (before async dispatch). The execution row is now marked ERROR with the failure reason.

Why

  • When add_input_file_to_api_storage ("Staging files in API storage") fails synchronously, the API returns HTTP 500 but the WorkflowExecution row created earlier was left PENDING and never marked ERROR. The UI showed the run as stuck/running forever and the real error wasn't visible in the execution logs.
  • Surfaced via the Moody's bou-unstract-ci S3/MinIO 403 incident (repro File Execution cd43e697-7e43-41c5-8979-dc1c291431cc). The storage/chart root cause is tracked separately in UN-3646.

How

  • api_v2/deployment_helper.pyexecute_workflow():
    • Moved the add_input_file_to_api_storage staging call inside the try/except so synchronous failures are handled instead of escaping the method.
    • Added an explicit WorkflowExecutionServiceHelper.update_execution_err(execution_id, error) call in the except. The existing handler only released the rate-limit slot, cleaned up storage, and built an error response — it never marked the DB row. The async path only gets marked because execute_workflow_async calls update_execution_err internally, which a staging failure never reaches.
  • Note: at staging time no FileExecution rows exist yet (they're created later in the async run), so marking the parent execution is the complete fix.

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

  • Low risk. The success path is unchanged (staging runs first inside the try, exactly as before). On failure, behavior is strictly improved: the row is marked ERROR in addition to the existing slot-release + storage cleanup. update_execution_err is idempotent and is the same helper the async-dispatch path already uses, so double-marking on the async path (if ever hit) is harmless.

Database Migrations

  • None.

Env Config

  • None.

Relevant Docs

Related Issues or PRs

  • Fixes UN-3648 / UN-3647 (API deployment stuck PENDING on synchronous staging failure)
  • Related: UN-3646 (S3/MinIO storage/chart root cause, tracked separately)

Dependencies Versions

  • None.

Notes on Testing

  • Added backend/api_v2/tests/test_deployment_helper.py — a regression test in the repo's sys.modules-stub style (usage_v2/tests/test_helper.py), so it runs with bare python3 (no Django/DB). It asserts a staging exception marks the execution ERROR, releases the rate-limit slot, cleans up storage, and never reaches async dispatch.
  • Verified the test fails on the pre-fix code (staging exception propagates) and passes on the fix. ruff check / ruff format clean; all pre-commit hooks pass.

Screenshots

Checklist

I have read and understood the Contribution Guidelines.

🤖 Generated with Claude Code

…ailure

When an API-deployment run failed synchronously at the "Staging files in
API storage" step (add_input_file_to_api_storage, before async dispatch),
the PENDING WorkflowExecution row was never marked ERROR, so the UI showed
the run as stuck/running forever and the real error wasn't surfaced.

Root cause: in api_v2/deployment_helper.py -> execute_workflow(), the
staging call sat outside the try/except. Only execute_workflow_async
failures were handled, and the DB row is marked ERROR by
execute_workflow_async internally -- a path a staging failure never reaches.

Fix:
- Move staging inside the try so synchronous failures are handled.
- In the except, explicitly call update_execution_err() to mark the row
  ERROR with the surfaced reason (the existing handler only did cleanup +
  built an error response, it never marked the DB row).

Add a regression test (sys.modules-stub style, no Django/DB) asserting a
staging exception marks the execution ERROR, releases the rate-limit slot,
and never reaches async dispatch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c76ded99-1a34-44d6-81bf-3667bebd2f5a

📥 Commits

Reviewing files that changed from the base of the PR and between fbb6d5a and 4918164.

📒 Files selected for processing (2)
  • backend/api_v2/deployment_helper.py
  • backend/api_v2/tests/test_deployment_helper.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/api_v2/deployment_helper.py

Summary by CodeRabbit

  • Bug Fixes
    • Workflow executions now reliably switch to Error when staging fails before async dispatch, with the execution explicitly marked using the failure reason.
    • Rate-limiter release and destination storage cleanup are consistently performed for both staging and later failures, without overwriting an execution that’s already running.
  • Tests
    • Added regression tests covering synchronous staging failures (including verifying async dispatch is not triggered) and cleanup behavior even if marking the execution as Error fails.

Walkthrough

DeploymentHelper.execute_workflow now marks staging failures as ERROR before cleanup, and the regression tests cover both the staging failure path and the case where error marking fails.

Changes

Workflow execution staging failure handling

Layer / File(s) Summary
Pre-dispatch staging error handling
backend/api_v2/deployment_helper.py
execute_workflow stages input before async dispatch, persists ERROR on staging failure, then releases the rate limiter, removes API storage, and returns an error response.
Test import stubs and helper loading
backend/api_v2/tests/test_deployment_helper.py
The regression module installs lazy import stubs, loads the real helper under those stubs, and prepares a mocked API helper with a staged failure.
Staging failure regression cases
backend/api_v2/tests/test_deployment_helper.py
The tests assert the staging-failure update, cleanup, skipped async dispatch, and the nested failure case where marking ERROR itself raises.

Sequence Diagram(s)

sequenceDiagram
  participant DeploymentHelper
  participant SourceConnector
  participant WorkflowExecutionServiceHelper
  participant APIDeploymentRateLimiter
  participant DestinationConnector

  DeploymentHelper->>SourceConnector: add_input_file_to_api_storage()
  SourceConnector--xDeploymentHelper: staging error
  DeploymentHelper->>WorkflowExecutionServiceHelper: update_execution_err(execution_id, error)
  DeploymentHelper->>APIDeploymentRateLimiter: release_slot()
  DeploymentHelper->>DestinationConnector: delete_api_storage_dir()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: marking API deployment executions ERROR on synchronous staging failure.
Description check ✅ Passed The description follows the template and covers the required sections with clear What/Why/How, risks, testing, and related issues.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/un-3647-stuck-pending

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes API-deployment executions getting stuck in PENDING when add_input_file_to_api_storage fails synchronously before async dispatch. Previously the staging call sat outside the try/except, so exceptions propagated unhandled and the WorkflowExecution row was never transitioned out of PENDING.

  • Core fix (deployment_helper.py): Moves add_input_file_to_api_storage inside a dedicated try/except. On failure it calls update_execution_err (itself guarded against transient DB errors), releases the rate-limit slot, cleans up storage, and returns a structured ERROR response — exactly mirroring what the async dispatch path already did internally.
  • Tests (test_deployment_helper.py): Two regression tests using a sys.modules-stub approach verify (1) a staging exception marks the execution ERROR and never reaches async dispatch, and (2) cleanup still runs even when the update_execution_err DB write itself raises.

Confidence Score: 5/5

Safe to merge — the success path is unchanged and the new staging failure branch is a strict improvement over the previous unhandled exception.

The change is narrow and well-scoped: it adds a dedicated try/except around the staging call, guards the DB error-marking call so transient DB errors cannot block cleanup, and returns a structured error response that mirrors the existing dispatch-failure pattern. The two regression tests verify both the normal staging-failure path and the DB-error-during-marking edge case. No migrations, no schema changes, no new external dependencies.

No files require special attention.

Important Files Changed

Filename Overview
backend/api_v2/deployment_helper.py Staging call moved inside its own try/except; update_execution_err is guarded; cleanup runs on staging failure. Success path is unchanged.
backend/api_v2/tests/test_deployment_helper.py New regression tests using sys.modules stubbing; covers staging failure marking ERROR and cleanup surviving a DB error during error-marking.
backend/api_v2/tests/init.py Empty init file to make tests/ a proper Python package.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[execute_workflow called] --> B[create_workflow_execution\nrow = PENDING]
    B --> C[cache API hub headers]
    C --> D{add_input_file_to_api_storage\nstaging}
    D -- success --> E[execute_workflow_async\ndispatch to Celery]
    D -- raises --> F[update_execution_err\nrow = ERROR\nguarded try/except]
    F --> G[release_slot]
    G --> H[delete_api_storage_dir]
    H --> I[return ERROR response]
    E -- success --> J[enrich result / return]
    E -- raises --> K[release_slot\ndelete_api_storage_dir\nreturn ERROR response]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[execute_workflow called] --> B[create_workflow_execution\nrow = PENDING]
    B --> C[cache API hub headers]
    C --> D{add_input_file_to_api_storage\nstaging}
    D -- success --> E[execute_workflow_async\ndispatch to Celery]
    D -- raises --> F[update_execution_err\nrow = ERROR\nguarded try/except]
    F --> G[release_slot]
    G --> H[delete_api_storage_dir]
    H --> I[return ERROR response]
    E -- success --> J[enrich result / return]
    E -- raises --> K[release_slot\ndelete_api_storage_dir\nreturn ERROR response]
Loading

Reviews (3): Last reviewed commit: "Drop ticket/incident references from in-..." | Re-trigger Greptile

Comment thread backend/api_v2/deployment_helper.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/api_v2/deployment_helper.py (1)

291-311: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

update_execution_err in this broad except can mark an already-dispatched execution ERROR and corrupt its state.

The try block includes both the synchronous staging call and execute_workflow_async plus all downstream processing (lines 264–290, e.g., _enrich_result_with_workflow_metadata, Configuration lookups, usage enrichment). execute_workflow_async dispatches the Celery job immediately and returns; if any step after dispatch raises (e.g., metadata enrichment or config retrieval), this except block will:

  1. Call update_execution_err, overwriting the execution status to ERROR (conflicting with the running worker).
  2. Release the rate-limit slot, which may be invalid since the job successfully dispatched.
  3. Delete the storage directory, removing data the in-flight worker needs.

This cleanup assumes the job was never dispatched. Scope the error handling to pre-dispatch failures only—e.g., nest the staging call in its own try or track a dispatched flag and skip update_execution_err, release_slot, and delete_api_storage_dir once execute_workflow_async succeeds.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/api_v2/deployment_helper.py` around lines 291 - 311, The broad
exception handler in `deployment_helper.py` is treating post-dispatch failures
the same as pre-dispatch failures, which can overwrite an already running
execution. Narrow the `try/except` around the synchronous staging path or use a
`dispatched` flag around `execute_workflow_async` so
`WorkflowExecutionServiceHelper.update_execution_err`,
`APIDeploymentRateLimiter.release_slot`, and
`DestinationConnector.delete_api_storage_dir` only run when dispatch has not
succeeded. Keep the cleanup logic in the failure path before dispatch, and skip
it for errors raised during later enrichment or config lookup steps.
🧹 Nitpick comments (1)
backend/api_v2/deployment_helper.py (1)

295-305: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Cleanup is skipped if update_execution_err raises.

update_execution_err performs a DB fetch/save and can raise on transient DB errors (the snippet only guards WorkflowExecution.DoesNotExist). Since it now runs first, a failure here would skip release_slot and delete_api_storage_dir and propagate, leaking the rate-limit slot and staged files. Consider isolating the error-marking so cleanup still runs.

♻️ Isolate error-marking from cleanup
-            WorkflowExecutionServiceHelper.update_execution_err(
-                str(execution_id), str(error)
-            )
-
-            # Release rate limit slot (workflow setup/dispatch failed, async job not started)
-            APIDeploymentRateLimiter.release_slot(api.organization, str(execution_id))
-
-            # Clean up storage
-            DestinationConnector.delete_api_storage_dir(
-                workflow_id=workflow_id, execution_id=execution_id
-            )
+            try:
+                WorkflowExecutionServiceHelper.update_execution_err(
+                    str(execution_id), str(error)
+                )
+            except Exception:
+                logger.exception(
+                    f"Failed to mark execution {execution_id} as ERROR"
+                )
+
+            # Release rate limit slot (workflow setup/dispatch failed, async job not started)
+            APIDeploymentRateLimiter.release_slot(api.organization, str(execution_id))
+
+            # Clean up storage
+            DestinationConnector.delete_api_storage_dir(
+                workflow_id=workflow_id, execution_id=execution_id
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/api_v2/deployment_helper.py` around lines 295 - 305, The cleanup path
in the deployment failure handler is currently blocked by
WorkflowExecutionServiceHelper.update_execution_err, so a transient DB error can
prevent APIDeploymentRateLimiter.release_slot and
DestinationConnector.delete_api_storage_dir from running. In the deployment
helper’s failure handling block, isolate the error-marking call from the cleanup
steps by catching/logging failures from update_execution_err separately, then
always execute the rate-limit release and storage deletion afterward.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@backend/api_v2/deployment_helper.py`:
- Around line 291-311: The broad exception handler in `deployment_helper.py` is
treating post-dispatch failures the same as pre-dispatch failures, which can
overwrite an already running execution. Narrow the `try/except` around the
synchronous staging path or use a `dispatched` flag around
`execute_workflow_async` so
`WorkflowExecutionServiceHelper.update_execution_err`,
`APIDeploymentRateLimiter.release_slot`, and
`DestinationConnector.delete_api_storage_dir` only run when dispatch has not
succeeded. Keep the cleanup logic in the failure path before dispatch, and skip
it for errors raised during later enrichment or config lookup steps.

---

Nitpick comments:
In `@backend/api_v2/deployment_helper.py`:
- Around line 295-305: The cleanup path in the deployment failure handler is
currently blocked by WorkflowExecutionServiceHelper.update_execution_err, so a
transient DB error can prevent APIDeploymentRateLimiter.release_slot and
DestinationConnector.delete_api_storage_dir from running. In the deployment
helper’s failure handling block, isolate the error-marking call from the cleanup
steps by catching/logging failures from update_execution_err separately, then
always execute the rate-limit release and storage deletion afterward.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d01d40f6-44ce-40db-99fa-55cff80d3f5c

📥 Commits

Reviewing files that changed from the base of the PR and between 24d0872 and 28aa394.

📒 Files selected for processing (3)
  • backend/api_v2/deployment_helper.py
  • backend/api_v2/tests/__init__.py
  • backend/api_v2/tests/test_deployment_helper.py

Address Greptile P1 + CodeRabbit Major on PR #2120:

- Give the synchronous staging call its own try/except instead of sharing the
  try with execute_workflow_async + post-dispatch processing. Error-marking
  (update_execution_err) now applies only to genuine pre-dispatch failures and
  can no longer overwrite an already-dispatched/completed execution's status.
- Isolate the update_execution_err call in its own try/except so a transient DB
  error while marking ERROR no longer skips slot release and storage cleanup
  (which were unconditional before this PR).
- Restore the dispatch/post-processing except to its original behaviour (slot
  release + storage cleanup only); dispatch failures are already marked ERROR
  internally by execute_workflow_async.

Add a regression test asserting cleanup still runs when update_execution_err
itself raises.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@athul-rs

Copy link
Copy Markdown
Contributor Author

@coderabbitai Both points addressed in fbb6d5a:

  • Major (broad try could mark an already-dispatched execution ERROR / delete its storage): staging now has its own try/except, so update_execution_err + cleanup only run for pre-dispatch failures. The dispatch/post-processing except no longer calls update_execution_err — dispatch failures are already marked internally by execute_workflow_async, and post-dispatch enrichment/config failures must not overwrite a running execution.
  • Nitpick (cleanup skipped if update_execution_err raises): the error-marking is now isolated in its own try/except + logger.exception, so release_slot and delete_api_storage_dir always run.

Added a regression test for the isolated-error-marking case.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@athul-rs Thanks for the update — that addresses both concerns cleanly. Isolating the pre-dispatch failure handling and making cleanup unconditional is the right shape here, and the regression test helps lock it in.


If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Keep code comments focused on explaining the code; ticket and incident context
lives in the commit/PR history instead.
@sonarqubecloud

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown
Contributor

Unstract test results

Per-group results

Status Group Tier Passed Failed Errors Skipped Duration (s)
unit-connectors unit 64 12 0 3 16.7
unit-core unit 0 0 4 0 1.2
unit-platform-service unit 9 0 1 0 1.3
unit-prompt-service unit 15 0 0 0 19.8
unit-rig unit 53 0 0 0 3.3
unit-runner unit 11 0 0 0 3.1
unit-sdk1 unit 390 0 0 0 21.0
unit-tool-registry unit 0 0 1 0 1.2
unit-workers unit 0 0 0 0 17.3
TOTAL 542 12 6 3 84.9

Critical paths

⚠️ Critical paths not yet covered

  • auth-login — User can log in and obtain a session cookie. (entry: POST /api/v1/auth/login; declared coverage: no groups declared)
  • adapter-register-llm — Register and validate an LLM adapter. (entry: POST /api/v1/adapter/; declared coverage: no groups declared)
  • workflow-create-execute — Create a workflow, configure source+destination, execute, poll, fetch result. (entry: POST /api/v1/workflow/{id}/execute/; declared coverage: e2e-workflow)
  • api-deployment-run — Deploy a workflow as an API, POST a document, receive structured JSON. (entry: POST /deployment/api/{org}/{name}/; declared coverage: e2e-api-deployment)
  • prompt-studio-fetch-response — Prompt Studio: create project, add prompt, run single-pass, get response. (entry: POST /api/v1/prompt-studio/prompt-studio-tool/{id}/fetch_response/; declared coverage: e2e-prompt-studio)
  • pipeline-etl-execute — Run an ETL pipeline from source connector to destination. (entry: POST /api/v1/pipeline/{id}/execute/; declared coverage: no groups declared)
  • usage-token-tracking — Per-execution token usage is recorded and retrievable. (entry: GET /api/v1/usage/get_token_usage/; declared coverage: no groups declared)
  • workflow-execution-fan-out — Multi-file workflow execution fans out to file-processing workers and rejoins. (entry: internal: backend → rabbitmq → workers/file_processing; declared coverage: no groups declared)
  • callback-result-delivery — Async results are posted back via the callback worker. (entry: internal: workers/callback → backend /internal endpoints; declared coverage: no groups declared)
✅ Covered critical paths
  • tool-sandbox-exec — covered by unit-runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant