Add per-node secret rotation tracking with drift detection#1781
Add per-node secret rotation tracking with drift detection#1781lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Conversation
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/56ac80bd0e7547ad88350eb0206886b5 ✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 18m 47s |
04cc55a to
1698305
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/db62c9cd33b34a538c7eccf243769b6a ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 02m 26s |
3885c4a to
c1fe8f8
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b5d3972863e64857b2da5055f867ef55 ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 20m 43s |
|
/retest |
|
recheck |
|
/test openstack-operator-build-deploy-kuttl-4-18 |
cbfbb7c to
f52529a
Compare
f52529a to
017d2ca
Compare
|
/test functional |
97bb482 to
b1d9350
Compare
|
/test openstack-operator-build-deploy-kuttl-4-18 |
|
Or even simpler...we already have the Secret and ConfigMap hashes saved in the Deployment statuses. If the rabbitmq user rotation see that those hashes are out of date, the rotation, or at least the old user deletion part of the rotation is blocked. |
Thanks @slagle , appreciate you taking the time. The additional stuff "on top" is required because we could have different rabbitmq users for nova_compute, neutron and ironic agents running in the dataplane, so I try to track which node in a nodeset ran a deployment for the aforementioned services and store that in a configmap that we update until all have reconciled to the hashes that you mention in the last comment. Here how it could look like: If I understand correctly you would like to flip this around and have infra-operator track each nodeset rabbitmq usage instead? Not sure having infra-operator introspect dataplane objects is my preferred approach, especially because we have no way of knowing if one additional service will be added tomorrow that could use rabbitmq, so we would have to play catch up with the dataplane. That said, I can try to prototype something and see how ugly it gets. |
we can not do that. that would introduce a circular dependency because infra would add a dependency on the openstack-operator. |
b1d9350 to
cd2cef1
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lmiccini The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Stage 1: Initial State - All Nodes on Old CredentialsSecret (cluster state)apiVersion: v1
kind: Secret
metadata:
name: nova-cell1-compute-config
resourceVersion: "12345"
data:
transport_url: "rabbit://user7:pass@rabbitmq:5672/" # Old credentialsConfigMap (tracking state)apiVersion: v1
kind: ConfigMap
metadata:
name: openstack-edpm-ipam-secret-tracking
data:
nova-cell1-compute-config: |
{
"currentHash": "n5b4h...",
"expectedHash": "n5b4h...",
"nodes": {
"compute-0": {
"secretHash": "n5b4h...",
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
},
"compute-1": {
"secretHash": "n5b4h...",
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
}
}
}NodeSet Statusstatus:
secretDeployment:
configMapName: openstack-edpm-ipam-secret-tracking
totalNodes: 2
updatedNodes: 2 # ✓ All nodes on current version
allNodesUpdated: true # ✓ Safe to delete old credentials (if they existed)
lastUpdateTime: "2026-02-20T10:00:00Z"State: All nodes running with user7, no drift, system stable. Stage 2: Credential Rotation - Cluster Secret ChangesAdministrator rotates RabbitMQ credentials by updating the openstackcontrolplane, switching cell1 to use a different user. Secret (cluster state) - CHANGEDapiVersion: v1
kind: Secret
metadata:
name: nova-cell1-compute-config
resourceVersion: "67890" # ← Changed
data:
transport_url: "rabbit://user8:pass@rabbitmq:5672/" # ← New credentialsConfigMap (tracking state) - UNCHANGEDapiVersion: v1
kind: ConfigMap
metadata:
name: openstack-edpm-ipam-secret-tracking
data:
nova-cell1-compute-config: |
{
"currentHash": "n5b4h...", # Still old hash
"expectedHash": "n5b4h...", # Still old hash
"nodes": {
"compute-0": {
"secretHash": "n5b4h...", # Still old hash
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
},
"compute-1": {
"secretHash": "n5b4h...", # Still old hash
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
}
}
}NodeSet Status - DRIFT DETECTEDstatus:
secretDeployment:
configMapName: openstack-edpm-ipam-secret-tracking
totalNodes: 2
updatedNodes: 0 # ← Changed: drift detected, reset to 0
allNodesUpdated: false # ← Changed: drift exists
lastUpdateTime: "2026-02-23T11:36:46Z" # ← Updated by drift detectionState: Drift detected! Nodes still have user7, but cluster expects user8. Stage 3: Partial Deployment - Update compute-0 OnlyAdministrator creates deployment with DeploymentapiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: edpm-deployment-c0-limit
spec:
nodeSets:
- openstack-edpm-ipam
ansibleLimit: compute-0 # Only this nodeAfter deployment completes: Secret (cluster state) - UNCHANGEDdata:
transport_url: "rabbit://user8:pass@rabbitmq:5672/" # Still user8ConfigMap (tracking state) - PARTIALLY UPDATEDapiVersion: v1
kind: ConfigMap
metadata:
name: openstack-edpm-ipam-secret-tracking
data:
nova-cell1-compute-config: |
{
"currentHash": "n5b4h...", # ← NOT updated (compute-1 still on n5b4h)
"expectedHash": "n656h...", # ← Updated to cluster hash
"nodes": {
"compute-0": {
"secretHash": "n656h...", # ← Updated to user8
"deploymentName": "edpm-deployment-c0-limit",
"lastUpdated": "2026-02-23T12:00:00Z"
},
"compute-1": {
"secretHash": "n5b4h...", # ← Still on user7
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
}
}
}NodeSet Status - PARTIAL UPDATEstatus:
secretDeployment:
configMapName: openstack-edpm-ipam-secret-tracking
totalNodes: 2
updatedNodes: 1 # Only 1 of 2 nodes updated
allNodesUpdated: false # ← Still false
lastUpdateTime: "2026-02-23T12:00:00Z"State: compute-0 now has user8, compute-1 still has user7. Stage 4: Full Deployment - Update All Remaining NodesAdministrator deploys to all nodes (or remaining nodes). DeploymentapiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: edpm-deployment-full
spec:
nodeSets:
- openstack-edpm-ipam
# No ansibleLimit - all nodesAfter deployment completes: Secret (cluster state) - UNCHANGEDdata:
transport_url: "rabbit://user8:pass@rabbitmq:5672/" # Still user8ConfigMap (tracking state) - FULLY UPDATEDapiVersion: v1
kind: ConfigMap
metadata:
name: openstack-edpm-ipam-secret-tracking
data:
nova-cell1-compute-config: |
{
"currentHash": "n656h...", # ← Updated: all nodes on n656h
"expectedHash": "n656h...", # Matches cluster
"nodes": {
"compute-0": {
"secretHash": "n656h...", # user8
"deploymentName": "edpm-deployment-full",
"lastUpdated": "2026-02-23T13:00:00Z"
},
"compute-1": {
"secretHash": "n656h...", # ← Updated to user8
"deploymentName": "edpm-deployment-full",
"lastUpdated": "2026-02-23T13:00:00Z"
}
}
}NodeSet Status - ALL UPDATEDstatus:
secretDeployment:
configMapName: openstack-edpm-ipam-secret-tracking
totalNodes: 2
updatedNodes: 2 # ← All nodes updated
allNodesUpdated: true # ← Safe to proceed!
lastUpdateTime: "2026-02-23T13:00:00Z"State: All nodes now have user8, no drift. Stage 5: Multiple NodeSets ScenarioWhat if multiple NodeSets share the same credentials? Setup
After Partial Deployment (compute NodeSet only)Compute NodeSet Statusstatus:
secretDeployment:
totalNodes: 2
updatedNodes: 2
allNodesUpdated: true # ✓ Compute NodeSet is doneStorage NodeSet Statusstatus:
secretDeployment:
totalNodes: 2
updatedNodes: 0
allNodesUpdated: false # ✗ Storage NodeSet not updatedCredential Status: After Deploying Both NodeSetsCompute NodeSet Statusstatus:
secretDeployment:
totalNodes: 2
updatedNodes: 2
allNodesUpdated: true # ✓Storage NodeSet Statusstatus:
secretDeployment:
totalNodes: 2
updatedNodes: 2
allNodesUpdated: true # ✓Credential Status: ✓ SAFE - All 4 nodes across both NodeSets updated. Now safe to delete user7! Key ObservationscurrentHash vs expectedHash
Drift Detection LogicCredential Deletion SafetyOld credentials can ONLY be deleted when:
Stale Deployment HandlingIf deployment
|
|
This new approach can be used directly by openstack-operator to set and remove finalizers on rabbitmq users, or by infra-operator to read the nodeset status and do the finalizer management (openstack-k8s-operators/infra-operator@main...lmiccini:infra-operator:track_dataplaneusers) |
|
I think this change actually encompasses several pieces of functionality that could be expressed as different features. Each of which probably need a good design and discussion how we want them to work. A couple I can see right off:
I feel like we'll short-circuited some of the design/acceptance/testing cycle if we merge this. This is a big change to the NodeSet type and controller, which is an area edpm is largely responsible for, but I don't feel like that team has a good handle on taking ownership of this functionality (testing, docs, etc). Can't we just say to users to do a deployment across all nodesets without ansible limit before deleting old credentials? I'm not sure that's something we need to solve in code right away. |
|
Absolutely, right now we are documenting how to check and eventually remove the rotated users, so we are not in a rush to merge this as is. |
cd2cef1 to
490e443
Compare
490e443 to
3157e2b
Compare
OpenStackControlPlane CRD Size Report
Threshold reference
|
8db6b16 to
d974015
Compare
d974015 to
45f39f8
Compare
6f97472 to
5b509f3
Compare
Track secret deployment progress across nodeset nodes using a ConfigMap to store detailed per-secret version info and a summary in the CR status. This enables credential rotation safety by blocking deletion of old credentials until all nodes have been updated with the new version. Key capabilities: - Per-node tracking of which secret versions are deployed - Drift detection comparing cluster secrets vs deployed versions - Gradual rollout support via AnsibleLimit-aware node tracking - Stale deployment detection to prevent flip-flop during rotation - Fail-safe defaults (assume drift when tracking is unavailable) Test coverage: - Unit tests (credentials_test.go): 25 test cases covering pure logic including hash comparison, node accumulation, drift detection, stale deployment handling, gradual rollout, multi-secret rotation, and JSON serialization - Ginkgo functional tests (openstackdataplanenodeset_credentials_test.go): 4 integration tests exercising the full reconciliation loop with envtest, covering tracking ConfigMap lifecycle, SecretDeploymentStatus population, drift detection after secret rotation, and recovery after redeployment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5b509f3 to
5880a65
Compare
|
@lmiccini: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Uh oh!
There was an error while loading. Please reload this page.