Skip to content

fix(statefulset): don't skip Delete when scale-to-0 Update fails#1993

Merged
sunsingerus merged 1 commit into
Altinity:0.27.1from
dashashutosh80:fix/sts-recreate-cascade-on-update-conflict
May 29, 2026
Merged

fix(statefulset): don't skip Delete when scale-to-0 Update fails#1993
sunsingerus merged 1 commit into
Altinity:0.27.1from
dashashutosh80:fix/sts-recreate-cascade-on-update-conflict

Conversation

@dashashutosh80

Copy link
Copy Markdown
Contributor

doDeleteStatefulSet returned early when the courtesy scale-to-0 Update failed (e.g. 409 Conflict from concurrent kube-controller-manager writes to .status), never reaching the actual Delete. recreateStatefulSet then silently discarded the error and called createStatefulSet, which observed the still-existing STS and failed with AlreadyExists. The operator finalised the reconcile as Completed while the host stayed at Replicas=0 with no further self-correction.

doDeleteStatefulSet:

  • Treat scale-to-0 as best-effort. On Update failure, log a warning and proceed to Delete which terminates pods anyway.
  • Treat IsNotFound on Get as success (contract: STS is gone).
  • Skip scale-down when Replicas is already 0.
  • Propagate Delete errors instead of swallowing them.

recreateStatefulSet:

  • Stop discarding the error from doDeleteStatefulSet. Return a plain error so the next reconcile retries; ErrCRUDAbort would mark the whole CHI as ReconcileAbort and require manual intervention, which is too aggressive for a transient conflict.

Important items to consider before making a Pull Request

Please check items PR complies to:

  • All commits in the PR are squashed. More info
  • The PR is made into dedicated next-release branch, not into master branch1. More info
  • The PR is signed. More info

This issue fixes #1990

doDeleteStatefulSet returned early when the courtesy scale-to-0 Update
failed (e.g. 409 Conflict from concurrent kube-controller-manager
writes to .status), never reaching the actual Delete.
recreateStatefulSet then silently discarded the error and called
createStatefulSet, which observed the still-existing STS and failed
with AlreadyExists. The operator finalised the reconcile as Completed
while the host stayed at Replicas=0 with no further self-correction.

doDeleteStatefulSet:
- Treat scale-to-0 as best-effort. On Update failure, log a warning
  and proceed to Delete which terminates pods anyway.
- Treat IsNotFound on Get as success (contract: STS is gone).
- Skip scale-down when Replicas is already 0.
- Propagate Delete errors instead of swallowing them.

recreateStatefulSet:
- Stop discarding the error from doDeleteStatefulSet. Return a plain
  error so the next reconcile retries; ErrCRUDAbort would mark the
  whole CHI as ReconcileAbort and require manual intervention, which
  is too aggressive for a transient conflict.

Signed-off-by: dashashutosh80 <dashashutosh80@gmail.com>
@sunsingerus sunsingerus added the planned for review This feature is planned for review label May 27, 2026
@sunsingerus sunsingerus merged commit 9cfc325 into Altinity:0.27.1 May 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

planned for review This feature is planned for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants