[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire#62805
[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire#62805liaoxin01 wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes two bugs in the cloud warm-up cancellation/expiry flow that could leave BE with a stale warm-up job cache entry, causing repeated FE RPCs and heavy-work-pool stalls.
Changes:
- BE: Fixes the cancelled-status check in
CloudWarmUpManagerby usingErrorCode::CANCELLEDso cancelled jobs are correctly purged from_tablet_replica_cache. - FE: Fixes an NPE in
FrontendServiceImpl.getTabletReplicaInfos()when the warm-up job has already been removed, returningTStatusCode.CANCELLEDsafely. - Tests: Adds a FE unit test for the null-job path and a docker regression test covering cancel+expire behavior and performance/metrics signals.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
be/src/cloud/cloud_warm_up_manager.cpp |
Corrects CANCELLED detection so stale warm-up job cache entries get erased. |
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java |
Prevents NPE when warm-up job is missing/expired; returns CANCELLED. |
fe/fe-core/src/test/java/org/apache/doris/service/GetTabletReplicaInfosTest.java |
Unit test ensuring null warm-up job does not throw and returns CANCELLED. |
regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy |
New docker regression validating cache purge + avoiding slow commits after FE job removal. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two related bugs on the event-driven warm-up path caused heavy_work_pool to stall when a warm-up job was cancelled or expired while BE still held its id in _tablet_replica_cache. 1. BE side: CloudWarmUpManager::get_replica_info used `st.is<CANCELED>()`. `CANCELED` (one L) is not defined in ErrorCode; ADL resolved it to PCacheStatus::CANCELED = 9 from a proto enum, so the check was never true. When FE returned TStatus CANCELLED to tell BE a job was done, BE never pushed the job id into `cancelled_jobs`, leaving a zombie entry in _tablet_replica_cache that every subsequent commit_rowset re-queried. 2. FE side: FrontendServiceImpl.getTabletReplicaInfos, on the `job == null || job.isDone()` branch, still called `job.getJobId()` inside the log message. Once the job had been removed from cloudWarmUpJobs (past history_cloud_warm_up_job_keep_max_second), that call NPE'd. BE then took the thrift exception path and slept 2 seconds inside CloudWarmUpManager::_mtx, serialising every commit_rowset and backing up the heavy work pool. Fix BE to use `ErrorCode::CANCELLED` (matching the rest of the file), and log `request.getWarmUpJobId()` on the null-job branch. Tests: - GetTabletReplicaInfosTest: pure-Mockito unit test asserting that getTabletReplicaInfos does not NPE and returns CANCELLED when the job has been removed from CacheHotspotManager. - test_warm_up_cluster_event_cancel_expired.groovy: docker cloud regression that shortens history_cloud_warm_up_job_keep_max_second, cancels an event-driven LOAD warm-up job, waits for FE to remove it, and verifies post-removal inserts stay fast and BE reports skipped rowsets (indicating the cache entry was cleared).
877b351 to
c882b89
Compare
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
Proposed changes
Fix two related bugs on the event-driven warm-up path. Together they
stall BE's heavy work pool when a warm-up job is cancelled or expired
while BE still has its id in
_tablet_replica_cache.1. BE —
CANCELEDvsCANCELLEDtypobe/src/cloud/cloud_warm_up_manager.cppusedst.is<CANCELED>()(one L).
CANCELEDis not inErrorCode; ADL resolved it toPCacheStatus::CANCELED = 9from a proto enum, so the check comparedagainst 9 and was always false. When FE returned
TStatusCode.CANCELLED(value 1) to tell BE a job was done, BE never pushed the
job_idintocancelled_jobs, leaving a zombie entry in_tablet_replica_cachethat every subsequent
commit_rowsetre-queried.Fix: use
st.is<ErrorCode::CANCELLED>(), matching the samenamespace-qualified form used elsewhere in the file.
2. FE — NPE in
getTabletReplicaInfosafter job removalOnce a cancelled job was removed from
cloudWarmUpJobspasthistory_cloud_warm_up_job_keep_max_second,jobis null and the logcall NPE'd. With bug #1 keeping the stale id in BE cache, BE kept
hitting this RPC forever; each failure took the
apache::thrift::TExceptionbranch inthrift_rpc_helper.cpp, whichsleeps 2s inside
CloudWarmUpManager::_mtx. That serialisedbthread_fork_join(commit_rowset), blocked heavy-pool threads inCloudTabletsChannel::close, and backed up the heavy-pool queue —leading to load timeouts and query
Fragment RPC Phase1latency inthe 10s range.
Fix: log
request.getWarmUpJobId()instead; it is guaranteed set bythe enclosing
request.isSetWarmUpJobId()check.Tests
fe/fe-core/src/test/java/org/apache/doris/service/GetTabletReplicaInfosTest.java— pure-Mockito unit test. Forces
CacheHotspotManager.getCloudWarmUpJobto return
nulland assertsgetTabletReplicaInfosdoes not throwand returns
TStatusCode.CANCELLED.regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy— docker cloud regression. Shortens
history_cloud_warm_up_job_keep_max_second=30, creates anevent-driven LOAD warm-up job, cancels it, waits for FE to remove
it, then verifies (a) a follow-up batch of inserts completes in
<30s (buggy path would need ≥40s from 2s per-commit sleeps) and
(b) BE's
file_cache_event_driven_warm_up_skipped_rowset_nummetric grows, indicating the cache entry was cleared.
Checklist