Skip to content

Hot-reload procedure completed-cleaner & disk space warning threshold (+ fix evict-TTL unit)#18091

Open
CRZbulabula wants to merge 3 commits into
masterfrom
confignode-ops-config-hidden-hotreload
Open

Hot-reload procedure completed-cleaner & disk space warning threshold (+ fix evict-TTL unit)#18091
CRZbulabula wants to merge 3 commits into
masterfrom
confignode-ops-config-hidden-hotreload

Conversation

@CRZbulabula

@CRZbulabula CRZbulabula commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Motivation

While tidying the ConfigNode cluster-operations tunables, several parameters were documented as restart-only despite being safely applicable at runtime. This PR:

  1. Makes hot-reloadable (effectiveMode: restarthot_reload, with the runtime plumbing to actually apply the new value):
    • procedure_completed_evict_ttl
    • procedure_completed_clean_interval
    • disk_space_warning_threshold
  2. Fixes a pre-existing unit bug in the completed-procedure cleaner (see below).

Changes

Procedure completed-cleaner hot reload

procedure_completed_evict_ttl and procedure_completed_clean_interval are captured by CompletedProcedureRecycler at construction, so applying new values requires rescheduling it:

  • ProcedureExecutor keeps a reference to the recycler and adds restartCompletedCleaner() (remove old + schedule fresh). All accesses to the reference field occur under the executor's monitor (startCompletedCleaner / restartCompletedCleaner are synchronized), so the field is guarded by synchronized for both mutual exclusion and visibility.
  • ProcedureManager.updateCompletedProcedureCleaner() re-applies on the running (leader) executor; followers are no-ops and pick up the value on the next leader switch.
  • ConfigManager.setConfiguration captures the previous values and re-applies via handleProcedureCleanerHotReload, mirroring the existing handleHeartbeatIntervalHotReload / handleTopologyProbingHotReload pattern. Values are validated (> 0), consistent with existing hot-reload loaders.

disk_space_warning_threshold hot reload

  • New shared helper CommonDescriptor.loadHotModifiedDiskSpaceWarningThreshold() parses, validates ([0, 1)) and applies the value.
  • Used by the ConfigNode path so SHOW VARIABLES / cluster-parameter consistency checks reflect the change, and by the DataNode path, which additionally refreshes the JVMCommonUtils static copy that the ReadOnly disk guard actually consumes.

Bug fix: completed-procedure evict TTL unit

CompletedProcedureRecycler converted the clean interval from seconds to milliseconds but stored the evict TTL as raw seconds, while CompletedProcedureContainer.isExpired compares it against a System.currentTimeMillis() delta. Completed procedures were therefore evicted after ~evictTTL milliseconds instead of seconds (≈60 ms for the default 60 s), making completed results effectively un-queryable. The TTL is now converted to milliseconds at construction (evictTTLInMs).

effectiveMode / template

iotdb-system.properties.template updated accordingly (the three keys above flipped to hot_reload).

Testing

  • mvn compile passes for node-commons, confignode, datanode; spotless:check clean.
  • Unit tests (all green):
    • CompletedProcedureRecyclerTest — evict TTL is interpreted in seconds; a freshly completed procedure survives a large TTL while a stale one is evicted (regression guard for the seconds/millis fix).
    • TestProcedureExecutor#testRestartCompletedCleanerAppliesNewEvictTtl — hot reload swaps the recycler to one carrying the new TTL.
    • CommonDescriptorDiskSpaceWarningThresholdTest — disk threshold hot reload applies a valid value, keeps the current value when absent, and rejects out-of-range values without mutating the config.

…eshold

Config template (iotdb-system.properties.template):
- Hide (remove from template, still parsed with config-class defaults):
  enable_auto_leader_balance_for_ratis_consensus, enable_topology_probing,
  topology_probing_base_interval_in_ms, topology_probing_timeout_ratio.
- Change effectiveMode restart -> hot_reload for procedure_completed_evict_ttl,
  procedure_completed_clean_interval and disk_space_warning_threshold.

procedure_completed_evict_ttl / procedure_completed_clean_interval hot reload:
ProcedureExecutor keeps a reference to the CompletedProcedureRecycler and adds
restartCompletedCleaner() to re-schedule it with the new interval/TTL (both are
captured at construction). ProcedureManager.updateCompletedProcedureCleaner()
re-applies on the running (leader) executor; ConfigManager.setConfiguration
captures the previous values and re-applies via handleProcedureCleanerHotReload.
The recycler reference is volatile and both writers are synchronized.

disk_space_warning_threshold hot reload:
CommonDescriptor.loadHotModifiedDiskSpaceWarningThreshold() parses, validates
([0,1)) and applies the value; it is shared by the ConfigNode path (so
SHOW VARIABLES / cluster-parameter consistency reflect the change) and the
DataNode path, which additionally refreshes the JVMCommonUtils static copy that
the ReadOnly disk guard consumes.
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 39.28571% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.66%. Comparing base (383458f) to head (7ec2ac4).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...he/iotdb/confignode/conf/ConfigNodeDescriptor.java 0.00% 21 Missing ⚠️
...apache/iotdb/confignode/manager/ConfigManager.java 0.00% 8 Missing ⚠️
...che/iotdb/confignode/manager/ProcedureManager.java 0.00% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18091      +/-   ##
============================================
+ Coverage     41.57%   41.66%   +0.09%     
  Complexity      318      318              
============================================
  Files          5294     5296       +2     
  Lines        371424   371732     +308     
  Branches      48061    48094      +33     
============================================
+ Hits         154410   154881     +471     
+ Misses       217014   216851     -163     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

CompletedProcedureRecycler converted the clean interval from seconds to
milliseconds but stored the evict TTL as raw seconds, while
CompletedProcedureContainer.isExpired compares it against a
System.currentTimeMillis() delta. As a result completed procedures were
evicted after ~evictTTL milliseconds instead of seconds (e.g. ~60 ms for the
default 60 s), making the results effectively un-queryable. Convert the evict
TTL to milliseconds at construction (renamed to evictTTLInMs for clarity).

Tests:
- CompletedProcedureRecyclerTest: the evict TTL is interpreted in seconds; a
  freshly completed procedure survives a large TTL while a stale one is evicted
  (regression guard for the seconds/millis bug).
- TestProcedureExecutor.testRestartCompletedCleanerAppliesNewEvictTtl: hot
  reload replaces the recycler with one carrying the new evict TTL.
- CommonDescriptorDiskSpaceWarningThresholdTest: disk_space_warning_threshold
  hot reload applies a valid value, keeps the current value when absent, and
  rejects out-of-range values without mutating the config.
…leaner ref

The topology-probing tunables (enable_topology_probing,
topology_probing_base_interval_in_ms, topology_probing_timeout_ratio) and
enable_auto_leader_balance_for_ratis_consensus were removed from the config
template, but PR #17933 (PingCode V2-995) deliberately exposed the three
topology keys so 'set configuration' can toggle them at runtime, guarded by
IoTDBSetConfigurationIT.testSetTopologyProbingConfiguration. Removing them
reverted that fix and broke the test with '301: ignored config items ...
immutable or undefined'. Restore all four keys in the template; this PR now
only flips procedure_completed_evict_ttl / procedure_completed_clean_interval /
disk_space_warning_threshold to hot_reload plus the hot-reload plumbing.

Also address SonarCloud java:S3077: the completedProcedureRecycler field was
'volatile', but Sonar flags volatile on a mutable object reference as
insufficient. All accesses already occur under this instance's monitor
(startCompletedCleaner / restartCompletedCleaner are synchronized), so drop
volatile and synchronize the @testonly getter, guarding the field purely by
the monitor.
@CRZbulabula CRZbulabula changed the title Hide cluster-ops tunables and hot-reload procedure cleaner & disk threshold Hot-reload procedure completed-cleaner & disk space warning threshold (+ fix evict-TTL unit) Jul 2, 2026
@sonarqubecloud

sonarqubecloud Bot commented Jul 2, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant