Release reserved storage resources on VM deployment failure#13048
Release reserved storage resources on VM deployment failure#13048winterhazel wants to merge 1 commit intoapache:4.22from
Conversation
|
@sureshanaparti can we include this one on 4.22.1? |
|
@blueorangutan package |
|
@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.22 #13048 +/- ##
============================================
- Coverage 17.68% 17.68% -0.01%
+ Complexity 15793 15792 -1
============================================
Files 5922 5922
Lines 533096 533094 -2
Branches 65209 65205 -4
============================================
- Hits 94275 94270 -5
- Misses 428181 428184 +3
Partials 10640 10640
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17554 |
There was a problem hiding this comment.
Pull request overview
Fixes a resource-leak in VM deployment where partially-created storage resource reservations (volume / primary storage) could remain in resource_reservation after a deployment failure, preventing users from consuming their configured limits until manual/automatic cleanup.
Changes:
- Refactors
UserVmManagerImpl.reserveStorageResourcesForVmto populate a caller-owned reservation list so already-createdCheckedReservations are reliably closed in the caller’sfinallyblock even when later reservations fail. - Adds a DB upgrade cleanup step to purge stale
resource_reservationrows during 4.22.0 → 4.22.1 upgrade.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| server/src/main/java/com/cloud/vm/UserVmManagerImpl.java | Ensures partially-created storage reservations aren’t lost on exceptions, enabling deterministic release on deployment failure. |
| engine/schema/src/main/resources/META-INF/db/schema-42200to42210-cleanup.sql | Normalizes affected environments by removing stale resource_reservation entries during upgrade. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| -- Entries remaining on `cloud`.`resource_reservation` during the upgrade process are stale, so delete them. | ||
| -- This script was added to normalize volume/primary storage reservations that got stuck due to a bug on VM deployment, | ||
| -- but it is more interesting to introduce a smarter logic to clean these stale reservations in the future without the need | ||
| -- for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning old entries). | ||
| DELETE FROM `cloud`.`resource_reservation`; |
There was a problem hiding this comment.
The upgrade cleanup deletes all rows from cloud.resource_reservation, even though the described bug affects stale volume/primary_storage reservations. This broad delete could also wipe unrelated in-flight reservations (e.g., public IP/template/etc.) if an operator mistakenly runs the upgrade while other management servers are still operating. Consider restricting the DELETE to the affected resource types (and optionally to rows older than a conservative threshold) to reduce blast radius while still fixing the reported issue.
| -- Entries remaining on `cloud`.`resource_reservation` during the upgrade process are stale, so delete them. | |
| -- This script was added to normalize volume/primary storage reservations that got stuck due to a bug on VM deployment, | |
| -- but it is more interesting to introduce a smarter logic to clean these stale reservations in the future without the need | |
| -- for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning old entries). | |
| DELETE FROM `cloud`.`resource_reservation`; | |
| -- Entries remaining on `cloud`.`resource_reservation` during the upgrade process for the affected resource types are stale, so delete them. | |
| -- This script was added to normalize volume/primary storage reservations that got stuck due to a bug on VM deployment, | |
| -- so scope the cleanup to those reservation types to avoid removing unrelated in-flight reservations. | |
| -- It would still be more interesting to introduce a smarter logic to clean these stale reservations in the future without the need | |
| -- for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning old entries). | |
| DELETE FROM `cloud`.`resource_reservation` | |
| WHERE `resource_type` IN ('volume', 'primary_storage'); |
| List<CheckedReservation> checkedReservations = new ArrayList<>(); | ||
|
|
||
| try { | ||
| checkedReservations = reserveStorageResourcesForVm(owner, diskOfferingId, diskSize, dataDiskInfoList, rootDiskOfferingId, offering, volumesSize); | ||
| reserveStorageResourcesForVm(checkedReservations, owner, diskOfferingId, diskSize, dataDiskInfoList, rootDiskOfferingId, offering, volumesSize); |
There was a problem hiding this comment.
This change fixes a subtle leak path where partially-created CheckedReservations could be lost if an exception is thrown mid-reservation. There are existing UserVmManagerImplTest tests that use MockedConstruction<CheckedReservation>; adding a focused unit test that forces CheckedReservation construction to throw after the first/second reservation and then asserts that previously-created reservations are still closed (i.e., close() invoked) would help prevent regressions.
Description
PR #10140 changed how volume and primary storage resources are reserved in the deployment process. However, the new method has an issue in which, if the reservation of part of the storage resources fails (e.g. able to reserve a volume resource for the root disk, but unable to reserve primary storage for it), those that were previously reserved are never released. Hence, users are not able to fully utilize their configured limits.
This PR fixes this issue and, additionally, adds a query to clean the stale entries to the upgrade script.
It is more interesting to introduce a smarter logic to clean these stale reservations in the future without the need for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning entries older than an amount of time); however, as we are very close to the release of 4.22.1, there is not sufficient time to implement and test a more complex mechanism, so I opted instead to include a simple script to already normalize environments that are affected.
Types of changes
Feature/Enhancement Scale or Bug Severity
Bug Severity
How Has This Been Tested?
Storage resource reservation release on VM deployment failure
I configured volume and primary storage limits for an account to 1 and 2 GB, respectively. Then, I attempted to deploy a VM with a 50 MB root disk and a 5 GB data disk. This process ended in failure, as there were not enough volume resources available.
Before the changes, some stale volume and primary storage reservations for the root disk would remain in the database. Due to this, I was not able to deploy any more VMs for that account using these limits, even if it had only a single volume.
Failure on VM deployment due to insufficient volume resources
Performing the same procedure after the changes did not result in stale reservations.
Resource reservation on database upgrade
I upgraded an environment on 4.22.0 with stale volume and primary storage reservations to 4.22.1 and validated, after the upgrade finished, that there were no more stale entries.