fix: ensure navigation sidebar serves fresh data after course publish#38785
fix: ensure navigation sidebar serves fresh data after course publish#38785wgu-taylor-payne wants to merge 1 commit into
Conversation
|
Thanks for the pull request, @wgu-taylor-payne! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
After a course publish in Studio, the CourseNavigationBlocksView can cache stale block structure data for up to 1 hour. This happens because the block structure rebuild task runs with a 30-second delay, but the navigation view may be hit during that window, read the old block structure from its cache, and store the stale result under the new course_version key. The fix adds an update_collected_if_needed() call on cache miss, ensuring the block structure is fresh before we build and cache the navigation tree. This only runs on cache misses and adds negligible overhead for the common case (block structure already up-to-date).
5103ff8 to
4374146
Compare
ormsbee
left a comment
There was a problem hiding this comment.
I don't think this is operationally feasible for large courses and high traffic. Let's talk more about other possible mitigations.
|
|
||
| if not course_blocks: | ||
| # Ensure the block structure cache is up-to-date before reading. | ||
| get_block_structure_manager(course_key).update_collected_if_needed() |
There was a problem hiding this comment.
Going through the collection phase of a large course can be extremely expensive, which is why it's done asynchronously in celery tasks or management commands (it can often exceed the 30s timeout that many sites use for giving up on web worker requests). Placing it in the GET here also risks causing a stampede if it is a popular course that many concurrent users are trying to access, as parallel workers try to recompute the same collection phase data.
There was a problem hiding this comment.
Thank you, I appreciate this feedback. I'll look into another way of preventing the stale cache.
|
For instance, I think a course's navigation being incorrect for a minute after a deletion is a bad, but not necessarily release-blocking bug (FYI @crathbun428 and @jmakowski1123, who can weigh in here). If the wrong navigation is getting cached for an hour, then maybe that's the part that we should focus on for this fix. |
|
I agree with Dave, I would not classify a 30-60sec cache as a blocker. But an hour is a bigger problem. |
Summary
After a course publish, there is a ~30-second window where the navigation sidebar endpoint serves stale block structure data. Worse, the stale response is cached for 1 hour, extending the staleness far beyond the initial window.
This PR adds a synchronous staleness check before reading the block structure on a navigation cache miss.
Problem
This issue was brought to light while testing an internal Open edX instance. A unit was deleted in Studio, a refresh in the course in the learning MFE still showed the unit in the course outline.
In a Verawood sandbox, I deleted a unit in Studio, waited 30+ seconds and then refreshed and the unit was removed from the outline in the learning MFE. I deleted another unit in Studio and refreshed the course page within a few seconds, and the deleted unit still showed. Refreshing again before an hour passed, it still showed. Refreshing after an hour, it was no longer present in the outline.
Three components interact to create a race condition:
1. Block structure rebuild is delayed 30s after publish
The
course_publishedsignal handler queues the rebuild task with a 30s countdown:2. Navigation cache key includes course version — causing a miss on publish
The cache key uses
course_version, so after a publish the version changes and the previous cached response no longer matches. This is a cache miss.3. On cache miss, stale block structure data is read and cached for 1 hour
The
if not course_blocksbranch callsget_course_outline_block_tree(), which reads from the (still stale) block structure cache, then stores the result for 1 hour.Timeline:
Fix
Before reading the block structure on a cache miss, call
update_collected_if_needed(). This compares the cached block structure version against the modulestore version and synchronously rebuilds only if stale.Performance impact
is_up_to_datecheck (DB read + version comparison).Testing
To run the automated test:
Manual steps:
AI Usage
Used Kiro with model set to auto to aid in the discovery of the root cause of the error, talk through potential fixes, come up with a test that addresses the issue being fixed, and write this PR summary.