Skip to content

Release reserved storage resources on VM deployment failure#13048

Open
winterhazel wants to merge 1 commit intoapache:4.22from
scclouds:fix-storage-resource-allocation-on-vm-deploy-4.22
Open

Release reserved storage resources on VM deployment failure#13048
winterhazel wants to merge 1 commit intoapache:4.22from
scclouds:fix-storage-resource-allocation-on-vm-deploy-4.22

Conversation

@winterhazel
Copy link
Copy Markdown
Member

Description

PR #10140 changed how volume and primary storage resources are reserved in the deployment process. However, the new method has an issue in which, if the reservation of part of the storage resources fails (e.g. able to reserve a volume resource for the root disk, but unable to reserve primary storage for it), those that were previously reserved are never released. Hence, users are not able to fully utilize their configured limits.

This PR fixes this issue and, additionally, adds a query to clean the stale entries to the upgrade script.

It is more interesting to introduce a smarter logic to clean these stale reservations in the future without the need for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning entries older than an amount of time); however, as we are very close to the release of 4.22.1, there is not sufficient time to implement and test a more complex mechanism, so I opted instead to include a simple script to already normalize environments that are affected.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

Storage resource reservation release on VM deployment failure

I configured volume and primary storage limits for an account to 1 and 2 GB, respectively. Then, I attempted to deploy a VM with a 50 MB root disk and a 5 GB data disk. This process ended in failure, as there were not enough volume resources available.

Before the changes, some stale volume and primary storage reservations for the root disk would remain in the database. Due to this, I was not able to deploy any more VMs for that account using these limits, even if it had only a single volume.

Failure on VM deployment due to insufficient volume resources Screenshot from 2026-04-20 13-51-19
MariaDB [cloud]> select * from resource_reservation;
+-----+------------+-----------+-----------------+----------+------+-------------+-----------------+---------------------+
| id  | account_id | domain_id | resource_type   | amount   | tag  | resource_id | mgmt_server_id  | created             |
+-----+------------+-----------+-----------------+----------+------+-------------+-----------------+---------------------+
| 104 |          4 |         2 | volume          |        1 | NULL |        NULL | 236056202620519 | 2026-04-20 16:50:51 |
| 105 |          4 |         2 | primary_storage | 52428800 | NULL |        NULL | 236056202620519 | 2026-04-20 16:50:51 |
+-----+------------+-----------+-----------------+----------+------+-------------+-----------------+---------------------+
2 rows in set (0.001 sec)

Performing the same procedure after the changes did not result in stale reservations.

MariaDB [cloud]> select * from resource_reservation;
Empty set (0.001 sec)

Resource reservation on database upgrade

I upgraded an environment on 4.22.0 with stale volume and primary storage reservations to 4.22.1 and validated, after the upgrade finished, that there were no more stale entries.

@winterhazel
Copy link
Copy Markdown
Member Author

@sureshanaparti can we include this one on 4.22.1?

@winterhazel
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.68%. Comparing base (be89e6f) to head (afcbf2f).

Files with missing lines Patch % Lines
.../src/main/java/com/cloud/vm/UserVmManagerImpl.java 0.00% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.22   #13048      +/-   ##
============================================
- Coverage     17.68%   17.68%   -0.01%     
+ Complexity    15793    15792       -1     
============================================
  Files          5922     5922              
  Lines        533096   533094       -2     
  Branches      65209    65205       -4     
============================================
- Hits          94275    94270       -5     
- Misses       428181   428184       +3     
  Partials      10640    10640              
Flag Coverage Δ
uitests 3.69% <ø> (ø)
unittests 18.76% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17554

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a resource-leak in VM deployment where partially-created storage resource reservations (volume / primary storage) could remain in resource_reservation after a deployment failure, preventing users from consuming their configured limits until manual/automatic cleanup.

Changes:

  • Refactors UserVmManagerImpl.reserveStorageResourcesForVm to populate a caller-owned reservation list so already-created CheckedReservations are reliably closed in the caller’s finally block even when later reservations fail.
  • Adds a DB upgrade cleanup step to purge stale resource_reservation rows during 4.22.0 → 4.22.1 upgrade.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
server/src/main/java/com/cloud/vm/UserVmManagerImpl.java Ensures partially-created storage reservations aren’t lost on exceptions, enabling deterministic release on deployment failure.
engine/schema/src/main/resources/META-INF/db/schema-42200to42210-cleanup.sql Normalizes affected environments by removing stale resource_reservation entries during upgrade.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +22 to +26
-- Entries remaining on `cloud`.`resource_reservation` during the upgrade process are stale, so delete them.
-- This script was added to normalize volume/primary storage reservations that got stuck due to a bug on VM deployment,
-- but it is more interesting to introduce a smarter logic to clean these stale reservations in the future without the need
-- for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning old entries).
DELETE FROM `cloud`.`resource_reservation`;
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upgrade cleanup deletes all rows from cloud.resource_reservation, even though the described bug affects stale volume/primary_storage reservations. This broad delete could also wipe unrelated in-flight reservations (e.g., public IP/template/etc.) if an operator mistakenly runs the upgrade while other management servers are still operating. Consider restricting the DELETE to the affected resource types (and optionally to rows older than a conservative threshold) to reduce blast radius while still fixing the reported issue.

Suggested change
-- Entries remaining on `cloud`.`resource_reservation` during the upgrade process are stale, so delete them.
-- This script was added to normalize volume/primary storage reservations that got stuck due to a bug on VM deployment,
-- but it is more interesting to introduce a smarter logic to clean these stale reservations in the future without the need
-- for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning old entries).
DELETE FROM `cloud`.`resource_reservation`;
-- Entries remaining on `cloud`.`resource_reservation` during the upgrade process for the affected resource types are stale, so delete them.
-- This script was added to normalize volume/primary storage reservations that got stuck due to a bug on VM deployment,
-- so scope the cleanup to those reservation types to avoid removing unrelated in-flight reservations.
-- It would still be more interesting to introduce a smarter logic to clean these stale reservations in the future without the need
-- for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning old entries).
DELETE FROM `cloud`.`resource_reservation`
WHERE `resource_type` IN ('volume', 'primary_storage');

Copilot uses AI. Check for mistakes.
Comment on lines 4344 to +4347
List<CheckedReservation> checkedReservations = new ArrayList<>();

try {
checkedReservations = reserveStorageResourcesForVm(owner, diskOfferingId, diskSize, dataDiskInfoList, rootDiskOfferingId, offering, volumesSize);
reserveStorageResourcesForVm(checkedReservations, owner, diskOfferingId, diskSize, dataDiskInfoList, rootDiskOfferingId, offering, volumesSize);
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a subtle leak path where partially-created CheckedReservations could be lost if an exception is thrown mid-reservation. There are existing UserVmManagerImplTest tests that use MockedConstruction<CheckedReservation>; adding a focused unit test that forces CheckedReservation construction to throw after the first/second reservation and then asserts that previously-created reservations are still closed (i.e., close() invoked) would help prevent regressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants