UN-3648: Mark API deployment execution ERROR on synchronous staging failure#2120
UN-3648: Mark API deployment execution ERROR on synchronous staging failure#2120athul-rs wants to merge 3 commits into
Conversation
…ailure When an API-deployment run failed synchronously at the "Staging files in API storage" step (add_input_file_to_api_storage, before async dispatch), the PENDING WorkflowExecution row was never marked ERROR, so the UI showed the run as stuck/running forever and the real error wasn't surfaced. Root cause: in api_v2/deployment_helper.py -> execute_workflow(), the staging call sat outside the try/except. Only execute_workflow_async failures were handled, and the DB row is marked ERROR by execute_workflow_async internally -- a path a staging failure never reaches. Fix: - Move staging inside the try so synchronous failures are handled. - In the except, explicitly call update_execution_err() to mark the row ERROR with the surfaced reason (the existing handler only did cleanup + built an error response, it never marked the DB row). Add a regression test (sys.modules-stub style, no Django/DB) asserting a staging exception marks the execution ERROR, releases the rate-limit slot, and never reaches async dispatch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
Summary by CodeRabbit
Walkthrough
ChangesWorkflow execution staging failure handling
Sequence Diagram(s)sequenceDiagram
participant DeploymentHelper
participant SourceConnector
participant WorkflowExecutionServiceHelper
participant APIDeploymentRateLimiter
participant DestinationConnector
DeploymentHelper->>SourceConnector: add_input_file_to_api_storage()
SourceConnector--xDeploymentHelper: staging error
DeploymentHelper->>WorkflowExecutionServiceHelper: update_execution_err(execution_id, error)
DeploymentHelper->>APIDeploymentRateLimiter: release_slot()
DeploymentHelper->>DestinationConnector: delete_api_storage_dir()
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| Filename | Overview |
|---|---|
| backend/api_v2/deployment_helper.py | Staging call moved inside its own try/except; update_execution_err is guarded; cleanup runs on staging failure. Success path is unchanged. |
| backend/api_v2/tests/test_deployment_helper.py | New regression tests using sys.modules stubbing; covers staging failure marking ERROR and cleanup surviving a DB error during error-marking. |
| backend/api_v2/tests/init.py | Empty init file to make tests/ a proper Python package. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[execute_workflow called] --> B[create_workflow_execution\nrow = PENDING]
B --> C[cache API hub headers]
C --> D{add_input_file_to_api_storage\nstaging}
D -- success --> E[execute_workflow_async\ndispatch to Celery]
D -- raises --> F[update_execution_err\nrow = ERROR\nguarded try/except]
F --> G[release_slot]
G --> H[delete_api_storage_dir]
H --> I[return ERROR response]
E -- success --> J[enrich result / return]
E -- raises --> K[release_slot\ndelete_api_storage_dir\nreturn ERROR response]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[execute_workflow called] --> B[create_workflow_execution\nrow = PENDING]
B --> C[cache API hub headers]
C --> D{add_input_file_to_api_storage\nstaging}
D -- success --> E[execute_workflow_async\ndispatch to Celery]
D -- raises --> F[update_execution_err\nrow = ERROR\nguarded try/except]
F --> G[release_slot]
G --> H[delete_api_storage_dir]
H --> I[return ERROR response]
E -- success --> J[enrich result / return]
E -- raises --> K[release_slot\ndelete_api_storage_dir\nreturn ERROR response]
Reviews (3): Last reviewed commit: "Drop ticket/incident references from in-..." | Re-trigger Greptile
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/api_v2/deployment_helper.py (1)
291-311: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift
update_execution_errin this broadexceptcan mark an already-dispatched execution ERROR and corrupt its state.The
tryblock includes both the synchronous staging call andexecute_workflow_asyncplus all downstream processing (lines 264–290, e.g.,_enrich_result_with_workflow_metadata,Configurationlookups, usage enrichment).execute_workflow_asyncdispatches the Celery job immediately and returns; if any step after dispatch raises (e.g., metadata enrichment or config retrieval), thisexceptblock will:
- Call
update_execution_err, overwriting the execution status toERROR(conflicting with the running worker).- Release the rate-limit slot, which may be invalid since the job successfully dispatched.
- Delete the storage directory, removing data the in-flight worker needs.
This cleanup assumes the job was never dispatched. Scope the error handling to pre-dispatch failures only—e.g., nest the staging call in its own
tryor track adispatchedflag and skipupdate_execution_err,release_slot, anddelete_api_storage_dironceexecute_workflow_asyncsucceeds.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/api_v2/deployment_helper.py` around lines 291 - 311, The broad exception handler in `deployment_helper.py` is treating post-dispatch failures the same as pre-dispatch failures, which can overwrite an already running execution. Narrow the `try/except` around the synchronous staging path or use a `dispatched` flag around `execute_workflow_async` so `WorkflowExecutionServiceHelper.update_execution_err`, `APIDeploymentRateLimiter.release_slot`, and `DestinationConnector.delete_api_storage_dir` only run when dispatch has not succeeded. Keep the cleanup logic in the failure path before dispatch, and skip it for errors raised during later enrichment or config lookup steps.
🧹 Nitpick comments (1)
backend/api_v2/deployment_helper.py (1)
295-305: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winCleanup is skipped if
update_execution_errraises.
update_execution_errperforms a DB fetch/save and can raise on transient DB errors (the snippet only guardsWorkflowExecution.DoesNotExist). Since it now runs first, a failure here would skiprelease_slotanddelete_api_storage_dirand propagate, leaking the rate-limit slot and staged files. Consider isolating the error-marking so cleanup still runs.♻️ Isolate error-marking from cleanup
- WorkflowExecutionServiceHelper.update_execution_err( - str(execution_id), str(error) - ) - - # Release rate limit slot (workflow setup/dispatch failed, async job not started) - APIDeploymentRateLimiter.release_slot(api.organization, str(execution_id)) - - # Clean up storage - DestinationConnector.delete_api_storage_dir( - workflow_id=workflow_id, execution_id=execution_id - ) + try: + WorkflowExecutionServiceHelper.update_execution_err( + str(execution_id), str(error) + ) + except Exception: + logger.exception( + f"Failed to mark execution {execution_id} as ERROR" + ) + + # Release rate limit slot (workflow setup/dispatch failed, async job not started) + APIDeploymentRateLimiter.release_slot(api.organization, str(execution_id)) + + # Clean up storage + DestinationConnector.delete_api_storage_dir( + workflow_id=workflow_id, execution_id=execution_id + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/api_v2/deployment_helper.py` around lines 295 - 305, The cleanup path in the deployment failure handler is currently blocked by WorkflowExecutionServiceHelper.update_execution_err, so a transient DB error can prevent APIDeploymentRateLimiter.release_slot and DestinationConnector.delete_api_storage_dir from running. In the deployment helper’s failure handling block, isolate the error-marking call from the cleanup steps by catching/logging failures from update_execution_err separately, then always execute the rate-limit release and storage deletion afterward.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@backend/api_v2/deployment_helper.py`:
- Around line 291-311: The broad exception handler in `deployment_helper.py` is
treating post-dispatch failures the same as pre-dispatch failures, which can
overwrite an already running execution. Narrow the `try/except` around the
synchronous staging path or use a `dispatched` flag around
`execute_workflow_async` so
`WorkflowExecutionServiceHelper.update_execution_err`,
`APIDeploymentRateLimiter.release_slot`, and
`DestinationConnector.delete_api_storage_dir` only run when dispatch has not
succeeded. Keep the cleanup logic in the failure path before dispatch, and skip
it for errors raised during later enrichment or config lookup steps.
---
Nitpick comments:
In `@backend/api_v2/deployment_helper.py`:
- Around line 295-305: The cleanup path in the deployment failure handler is
currently blocked by WorkflowExecutionServiceHelper.update_execution_err, so a
transient DB error can prevent APIDeploymentRateLimiter.release_slot and
DestinationConnector.delete_api_storage_dir from running. In the deployment
helper’s failure handling block, isolate the error-marking call from the cleanup
steps by catching/logging failures from update_execution_err separately, then
always execute the rate-limit release and storage deletion afterward.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d01d40f6-44ce-40db-99fa-55cff80d3f5c
📒 Files selected for processing (3)
backend/api_v2/deployment_helper.pybackend/api_v2/tests/__init__.pybackend/api_v2/tests/test_deployment_helper.py
Address Greptile P1 + CodeRabbit Major on PR #2120: - Give the synchronous staging call its own try/except instead of sharing the try with execute_workflow_async + post-dispatch processing. Error-marking (update_execution_err) now applies only to genuine pre-dispatch failures and can no longer overwrite an already-dispatched/completed execution's status. - Isolate the update_execution_err call in its own try/except so a transient DB error while marking ERROR no longer skips slot release and storage cleanup (which were unconditional before this PR). - Restore the dispatch/post-processing except to its original behaviour (slot release + storage cleanup only); dispatch failures are already marked ERROR internally by execute_workflow_async. Add a regression test asserting cleanup still runs when update_execution_err itself raises. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@coderabbitai Both points addressed in fbb6d5a:
Added a regression test for the isolated-error-marking case. |
|
Tip For best results, initiate chat on the files or code changes.
If you found this review helpful, would you consider giving us a shout-out on X? Thank you for using CodeRabbit! |
Keep code comments focused on explaining the code; ticket and incident context lives in the commit/PR history instead.
|
Unstract test resultsPer-group results
Critical paths
|



What
Why
add_input_file_to_api_storage("Staging files in API storage") fails synchronously, the API returns HTTP 500 but theWorkflowExecutionrow created earlier was left PENDING and never marked ERROR. The UI showed the run as stuck/running forever and the real error wasn't visible in the execution logs.bou-unstract-ciS3/MinIO 403 incident (repro File Executioncd43e697-7e43-41c5-8979-dc1c291431cc). The storage/chart root cause is tracked separately in UN-3646.How
api_v2/deployment_helper.py→execute_workflow():add_input_file_to_api_storagestaging call inside thetry/exceptso synchronous failures are handled instead of escaping the method.WorkflowExecutionServiceHelper.update_execution_err(execution_id, error)call in theexcept. The existing handler only released the rate-limit slot, cleaned up storage, and built an error response — it never marked the DB row. The async path only gets marked becauseexecute_workflow_asynccallsupdate_execution_errinternally, which a staging failure never reaches.FileExecutionrows exist yet (they're created later in the async run), so marking the parent execution is the complete fix.Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
try, exactly as before). On failure, behavior is strictly improved: the row is marked ERROR in addition to the existing slot-release + storage cleanup.update_execution_erris idempotent and is the same helper the async-dispatch path already uses, so double-marking on the async path (if ever hit) is harmless.Database Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
backend/api_v2/tests/test_deployment_helper.py— a regression test in the repo'ssys.modules-stub style (usage_v2/tests/test_helper.py), so it runs with barepython3(no Django/DB). It asserts a staging exception marks the execution ERROR, releases the rate-limit slot, cleans up storage, and never reaches async dispatch.ruff check/ruff formatclean; all pre-commit hooks pass.Screenshots
Checklist
I have read and understood the Contribution Guidelines.
🤖 Generated with Claude Code