fix: Helm doesn't wait for Custom Resources on new releases using `upgrade...#32268
fix: Helm doesn't wait for Custom Resources on new releases using `upgrade...#32268pratheeknathani wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes Helm’s --wait behavior for newly created custom resources by preventing kstatus’s “empty status ⇒ Current” fallback from prematurely ending the wait during fresh installs (notably helm upgrade --install --wait).
Changes:
- Adjusts status aggregation and timeout reporting to treat certain “fresh” custom resources as
InProgresseven if kstatus reportsCurrent. - Adds
customResourceStatusPendingheuristic to detect the “fresh CR with empty status” window. - Adds test coverage for the pending→ready transition using a Knative-style CR fixture.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/kube/statuswait.go | Adds heuristic + aggregation/error-reporting adjustments to avoid early success on fresh custom resources with empty status. |
| pkg/kube/statuswait_test.go | Adds fixtures and a new test to validate waiting behavior for a fresh custom resource until status is populated. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if _, found, _ := unstructured.NestedInt64(u.Object, "status", "observedGeneration"); found { | ||
| return false | ||
| } | ||
| if conditions, found, _ := unstructured.NestedSlice(u.Object, "status", "conditions"); found && len(conditions) > 0 { | ||
| return false | ||
| } | ||
| return true |
| // - The resource is not one of the built-in types kstatus already understands. | ||
| // - metadata.generation is set, which the API server only maintains for | ||
| // resources whose CRD enables the status subresource. | ||
| // - The controller has not yet reported status.observedGeneration or any | ||
| // status.conditions. |
| // Simulate the controller reconciling the resource after a short delay. | ||
| go func() { | ||
| time.Sleep(time.Millisecond * 500) | ||
| ready := getRuntimeObjFromManifests(t, []string{customResourceReadyManifest})[0].(*unstructured.Unstructured) | ||
| err := fakeClient.Tracker().Update(gvr, ready, ready.GetNamespace()) | ||
| assert.NoError(t, err) | ||
| }() | ||
| err = statusWaiter.Wait(resourceListFor(objs), time.Second*10) | ||
| assert.NoError(t, err) |
…grade -- Fixes helm#32066 Signed-off-by: Pratheek Nathani <181894361+pratheeknathani@users.noreply.github.com>
ef11bd0 to
166c56b
Compare
|
Thanks for the review! I've addressed all three comments in 166c56b:
Verified locally: |
Summary
The behavior difference between
helm install --waitandhelm upgrade --install --waitis not caused by wait options being dropped on the way into the install code.Root cause
The behavior difference between
helm install --waitandhelm upgrade --install --waitis not caused by wait options being dropped on the way into the install code. I verified
that
pkg/cmd/upgrade.gocopiesWaitStrategy,WaitForJobsandTimeoutonto theinstall client, and both commands funnel through the same
runInstalland install action,so the wait configuration is identical.
The real cause is a race in the kstatus based status watcher (
pkg/kube/statuswait.go)when a brand new custom resource is created:
serving.knative.dev/v1Service) for a release that does not exist yet.with an empty
.statusbecause the controller has not reconciled it yet.status.Computeingithub.com/fluxcd/cli-utils) has no type specific rulefor the custom kind, finds no
status.observedGenerationmismatch and no conditions,and falls back to "the absence of any known conditions means the resource is current",
returning
Current.Currentand stops waiting, reportingsuccess before the resource is actually ready.
This matches the symptom in the issue (a CR briefly shown as
Unknown, then Helm reporting"all resources achieved desired status" before the underlying Pods are ready). It is timing
dependent, which is why it reproduces reliably in low latency environments (such as CI close
to the API server) and on first installs where the CR has no prior
.status. When therelease already exists the CR already has a populated
.statusfrom a previousreconciliation, so kstatus correctly keeps waiting and the bug does not appear.
What changed
pkg/kube/statuswait.gocustomResourceStatusPending, which detects a freshly created custom resource whosecontroller has not yet observed it. It returns true only when there is positive evidence
that a controller is expected to populate status, to keep the change safe:
(
status.GetLegacyConditionsFnreturns nil), so Services, Deployments, Jobs, Pods,StatefulSets, CRDs, etc. are never affected;
metadata.generationis set. The API server only maintainsmetadata.generationforresources whose CRD enables the
statussubresource, which implies a controller isexpected to report status. Config style CRDs without a status subresource never get a
generation, so they are skipped and never block;
status.observedGenerationor anystatus.conditions. As soon as either is present we defer to the normal kstatuscomputation, so resources that legitimately omit one of these fields are not blocked
forever.
statusObserver, when waiting forCurrent, a resource that is reportedCurrentbutis still a pending fresh custom resource is treated as
InProgressfor aggregation, soHelm keeps waiting until the controller reconciles it.
wait, the timeout error reporting uses the same adjustment so a stuck custom resourceis reported as
not ready. status: InProgressinstead of being silently treated as ready.pkg/kube/statuswait_test.gocustomResourceNoStatusManifestandcustomResourceReadyManifestfixtures(a Knative style
serving.knative.dev/v1Service).TestStatusWaitCustomResourcePendingwith two cases:times out with
resource Service/ns/hello-knative not ready. status: InProgress;observedGenerationand aReadycondition, and the wait then succeeds.Issue
Fixes #32066
Issue: #32066
Diffstat
Testing
Ran the relevant tests and linter for the changed files while developing.
Kept the change minimal and focused on this one issue.
AI assistance
I used GitHub Copilot to help write parts of this change. I've reviewed and tested it myself, I understand what it does, and I'll follow up on any review feedback.