Release reserved storage resources on VM deployment failure by winterhazel · Pull Request #13048 · apache/cloudstack

winterhazel · 2026-04-20T17:22:50Z

Description

PR #10140 changed how volume and primary storage resources are reserved in the deployment process. However, the new method has an issue in which, if the reservation of part of the storage resources fails (e.g. able to reserve a volume resource for the root disk, but unable to reserve primary storage for it), those that were previously reserved are never released. Hence, users are not able to fully utilize their configured limits.

This PR fixes this issue and, additionally, adds a query to clean the stale entries to the upgrade script.

It is more interesting to introduce a smarter logic to clean these stale reservations in the future without the need for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning entries older than an amount of time); however, as we are very close to the release of 4.22.1, there is not sufficient time to implement and test a more complex mechanism, so I opted instead to include a simple script to already normalize environments that are affected.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
Build/CI
Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Bug Severity

How Has This Been Tested?

Storage resource reservation release on VM deployment failure

I configured volume and primary storage limits for an account to 1 and 2 GB, respectively. Then, I attempted to deploy a VM with a 50 MB root disk and a 5 GB data disk. This process ended in failure, as there were not enough volume resources available.

Before the changes, some stale volume and primary storage reservations for the root disk would remain in the database. Due to this, I was not able to deploy any more VMs for that account using these limits, even if it had only a single volume.

Failure on VM deployment due to insufficient volume resources

MariaDB [cloud]> select * from resource_reservation;
+-----+------------+-----------+-----------------+----------+------+-------------+-----------------+---------------------+
| id  | account_id | domain_id | resource_type   | amount   | tag  | resource_id | mgmt_server_id  | created             |
+-----+------------+-----------+-----------------+----------+------+-------------+-----------------+---------------------+
| 104 |          4 |         2 | volume          |        1 | NULL |        NULL | 236056202620519 | 2026-04-20 16:50:51 |
| 105 |          4 |         2 | primary_storage | 52428800 | NULL |        NULL | 236056202620519 | 2026-04-20 16:50:51 |
+-----+------------+-----------+-----------------+----------+------+-------------+-----------------+---------------------+
2 rows in set (0.001 sec)

Performing the same procedure after the changes did not result in stale reservations.

MariaDB [cloud]> select * from resource_reservation;
Empty set (0.001 sec)

Resource reservation on database upgrade

I upgraded an environment on 4.22.0 with stale volume and primary storage reservations to 4.22.1 and validated, after the upgrade finished, that there were no more stale entries.

winterhazel · 2026-04-20T17:23:30Z

@sureshanaparti can we include this one on 4.22.1?

winterhazel · 2026-04-20T17:24:12Z

@blueorangutan package

blueorangutan · 2026-04-20T17:26:05Z

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

codecov · 2026-04-20T17:31:10Z

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.68%. Comparing base (be89e6f) to head (afcbf2f).

Files with missing lines	Patch %	Lines
.../src/main/java/com/cloud/vm/UserVmManagerImpl.java	0.00%	5 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               4.22   #13048      +/-   ##
============================================
- Coverage     17.68%   17.68%   -0.01%     
+ Complexity    15793    15792       -1     
============================================
  Files          5922     5922              
  Lines        533096   533094       -2     
  Branches      65209    65205       -4     
============================================
- Hits          94275    94270       -5     
- Misses       428181   428184       +3     
  Partials      10640    10640

Flag	Coverage Δ
uitests	`3.69% <ø> (ø)`
unittests	`18.76% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

blueorangutan · 2026-04-20T18:29:15Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17554

Copilot

Pull request overview

Fixes a resource-leak in VM deployment where partially-created storage resource reservations (volume / primary storage) could remain in resource_reservation after a deployment failure, preventing users from consuming their configured limits until manual/automatic cleanup.

Changes:

Refactors UserVmManagerImpl.reserveStorageResourcesForVm to populate a caller-owned reservation list so already-created CheckedReservations are reliably closed in the caller’s finally block even when later reservations fail.
Adds a DB upgrade cleanup step to purge stale resource_reservation rows during 4.22.0 → 4.22.1 upgrade.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
server/src/main/java/com/cloud/vm/UserVmManagerImpl.java	Ensures partially-created storage reservations aren’t lost on exceptions, enabling deterministic release on deployment failure.
engine/schema/src/main/resources/META-INF/db/schema-42200to42210-cleanup.sql	Normalizes affected environments by removing stale `resource_reservation` entries during upgrade.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T06:10:19Z

+-- Entries remaining on `cloud`.`resource_reservation` during the upgrade process are stale, so delete them.
+-- This script was added to normalize volume/primary storage reservations that got stuck due to a bug on VM deployment,
+-- but it is more interesting to introduce a smarter logic to clean these stale reservations in the future without the need
+-- for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning old entries).
+DELETE FROM `cloud`.`resource_reservation`;


The upgrade cleanup deletes all rows from cloud.resource_reservation, even though the described bug affects stale volume/primary_storage reservations. This broad delete could also wipe unrelated in-flight reservations (e.g., public IP/template/etc.) if an operator mistakenly runs the upgrade while other management servers are still operating. Consider restricting the DELETE to the affected resource types (and optionally to rows older than a conservative threshold) to reduce blast radius while still fixing the reported issue.

Suggested change

-- Entries remaining on `cloud`.`resource_reservation` during the upgrade process are stale, so delete them.

-- This script was added to normalize volume/primary storage reservations that got stuck due to a bug on VM deployment,

-- but it is more interesting to introduce a smarter logic to clean these stale reservations in the future without the need

-- for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning old entries).

DELETE FROM `cloud`.`resource_reservation`;

-- Entries remaining on `cloud`.`resource_reservation` during the upgrade process for the affected resource types are stale, so delete them.

-- This script was added to normalize volume/primary storage reservations that got stuck due to a bug on VM deployment,

-- so scope the cleanup to those reservation types to avoid removing unrelated in-flight reservations.

-- It would still be more interesting to introduce a smarter logic to clean these stale reservations in the future without the need

-- for upgrades (for instance, by having a heartbeat_time column for the reservations and automatically cleaning old entries).

DELETE FROM `cloud`.`resource_reservation`

WHERE `resource_type` IN ('volume', 'primary_storage');

Copilot · 2026-04-21T06:10:19Z

        List<CheckedReservation> checkedReservations = new ArrayList<>();

        try {
-            checkedReservations = reserveStorageResourcesForVm(owner, diskOfferingId, diskSize, dataDiskInfoList, rootDiskOfferingId, offering, volumesSize);
+            reserveStorageResourcesForVm(checkedReservations, owner, diskOfferingId, diskSize, dataDiskInfoList, rootDiskOfferingId, offering, volumesSize);


This change fixes a subtle leak path where partially-created CheckedReservations could be lost if an exception is thrown mid-reservation. There are existing UserVmManagerImplTest tests that use MockedConstruction<CheckedReservation>; adding a focused unit test that forces CheckedReservation construction to throw after the first/second reservation and then asserts that previously-created reservations are still closed (i.e., close() invoked) would help prevent regressions.

Fix storage resource allocation during VM creation

afcbf2f

winterhazel requested a review from sureshanaparti April 20, 2026 17:22

sureshanaparti requested review from DaanHoogland and Copilot April 21, 2026 05:59

sureshanaparti added this to the 4.22.1 milestone Apr 21, 2026

sureshanaparti added this to Apache CloudStack 4.22.1 Apr 21, 2026

github-project-automation bot moved this to Todo in Apache CloudStack 4.22.1 Apr 21, 2026

sureshanaparti moved this from Todo to In Progress in Apache CloudStack 4.22.1 Apr 21, 2026

Copilot started reviewing on behalf of sureshanaparti April 21, 2026 06:00 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release reserved storage resources on VM deployment failure#13048

Release reserved storage resources on VM deployment failure#13048
winterhazel wants to merge 1 commit intoapache:4.22from
scclouds:fix-storage-resource-allocation-on-vm-deploy-4.22

winterhazel commented Apr 20, 2026

Uh oh!

winterhazel commented Apr 20, 2026

Uh oh!

winterhazel commented Apr 20, 2026

Uh oh!

blueorangutan commented Apr 20, 2026

Uh oh!

codecov bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

blueorangutan commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

winterhazel commented Apr 20, 2026

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Bug Severity

How Has This Been Tested?

Storage resource reservation release on VM deployment failure

Resource reservation on database upgrade

Uh oh!

winterhazel commented Apr 20, 2026

Uh oh!

winterhazel commented Apr 20, 2026

Uh oh!

blueorangutan commented Apr 20, 2026

Uh oh!

codecov bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

blueorangutan commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Apr 20, 2026 •

edited

Loading