[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire by liaoxin01 · Pull Request #62805 · apache/doris

liaoxin01 · 2026-04-24T08:21:39Z

Proposed changes

Fix two related bugs on the event-driven warm-up path. Together they
stall BE's heavy work pool when a warm-up job is cancelled or expired
while BE still has its id in _tablet_replica_cache.

1. BE — `CANCELED` vs `CANCELLED` typo

be/src/cloud/cloud_warm_up_manager.cpp used st.is<CANCELED>()
(one L). CANCELED is not in ErrorCode; ADL resolved it to
PCacheStatus::CANCELED = 9 from a proto enum, so the check compared
against 9 and was always false. When FE returned TStatusCode.CANCELLED
(value 1) to tell BE a job was done, BE never pushed the job_id into
cancelled_jobs, leaving a zombie entry in _tablet_replica_cache
that every subsequent commit_rowset re-queried.

Fix: use st.is<ErrorCode::CANCELLED>(), matching the same
namespace-qualified form used elsewhere in the file.

2. FE — NPE in `getTabletReplicaInfos` after job removal

if (job == null || job.isDone()) {
    LOG.info("warmup job {} is not running, notify caller BE {} to cancel job",
            job.getJobId(), clientAddr);   // NPE when job == null
    ...
}

Once a cancelled job was removed from cloudWarmUpJobs past
history_cloud_warm_up_job_keep_max_second, job is null and the log
call NPE'd. With bug #1 keeping the stale id in BE cache, BE kept
hitting this RPC forever; each failure took the
apache::thrift::TException branch in thrift_rpc_helper.cpp, which
sleeps 2s inside CloudWarmUpManager::_mtx. That serialised
bthread_fork_join(commit_rowset), blocked heavy-pool threads in
CloudTabletsChannel::close, and backed up the heavy-pool queue —
leading to load timeouts and query Fragment RPC Phase1 latency in
the 10s range.

Fix: log request.getWarmUpJobId() instead; it is guaranteed set by
the enclosing request.isSetWarmUpJobId() check.

Tests

fe/fe-core/src/test/java/org/apache/doris/service/GetTabletReplicaInfosTest.java
— pure-Mockito unit test. Forces CacheHotspotManager.getCloudWarmUpJob
to return null and asserts getTabletReplicaInfos does not throw
and returns TStatusCode.CANCELLED.
regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy
— docker cloud regression. Shortens
history_cloud_warm_up_job_keep_max_second=30, creates an
event-driven LOAD warm-up job, cancels it, waits for FE to remove
it, then verifies (a) a follow-up batch of inserts completes in
<30s (buggy path would need ≥40s from 2s per-commit sleeps) and
(b) BE's file_cache_event_driven_warm_up_skipped_rowset_num
metric grows, indicating the cache entry was cleared.

Checklist

Bug fix
Unit test
Regression test

Thearas · 2026-04-24T08:21:52Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Copilot

Pull request overview

This PR fixes two bugs in the cloud warm-up cancellation/expiry flow that could leave BE with a stale warm-up job cache entry, causing repeated FE RPCs and heavy-work-pool stalls.

Changes:

BE: Fixes the cancelled-status check in CloudWarmUpManager by using ErrorCode::CANCELLED so cancelled jobs are correctly purged from _tablet_replica_cache.
FE: Fixes an NPE in FrontendServiceImpl.getTabletReplicaInfos() when the warm-up job has already been removed, returning TStatusCode.CANCELLED safely.
Tests: Adds a FE unit test for the null-job path and a docker regression test covering cancel+expire behavior and performance/metrics signals.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`be/src/cloud/cloud_warm_up_manager.cpp`	Corrects CANCELLED detection so stale warm-up job cache entries get erased.
`fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java`	Prevents NPE when warm-up job is missing/expired; returns CANCELLED.
`fe/fe-core/src/test/java/org/apache/doris/service/GetTabletReplicaInfosTest.java`	Unit test ensuring null warm-up job does not throw and returns CANCELLED.
`regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy`	New docker regression validating cache purge + avoiding slow commits after FE job removal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two related bugs on the event-driven warm-up path caused heavy_work_pool to stall when a warm-up job was cancelled or expired while BE still held its id in _tablet_replica_cache. 1. BE side: CloudWarmUpManager::get_replica_info used `st.is<CANCELED>()`. `CANCELED` (one L) is not defined in ErrorCode; ADL resolved it to PCacheStatus::CANCELED = 9 from a proto enum, so the check was never true. When FE returned TStatus CANCELLED to tell BE a job was done, BE never pushed the job id into `cancelled_jobs`, leaving a zombie entry in _tablet_replica_cache that every subsequent commit_rowset re-queried. 2. FE side: FrontendServiceImpl.getTabletReplicaInfos, on the `job == null || job.isDone()` branch, still called `job.getJobId()` inside the log message. Once the job had been removed from cloudWarmUpJobs (past history_cloud_warm_up_job_keep_max_second), that call NPE'd. BE then took the thrift exception path and slept 2 seconds inside CloudWarmUpManager::_mtx, serialising every commit_rowset and backing up the heavy work pool. Fix BE to use `ErrorCode::CANCELLED` (matching the rest of the file), and log `request.getWarmUpJobId()` on the null-job branch. Tests: - GetTabletReplicaInfosTest: pure-Mockito unit test asserting that getTabletReplicaInfos does not NPE and returns CANCELLED when the job has been removed from CacheHotspotManager. - test_warm_up_cluster_event_cancel_expired.groovy: docker cloud regression that shortens history_cloud_warm_up_job_keep_max_second, cancels an event-driven LOAD warm-up job, waits for FE to remove it, and verifies post-removal inserts stay fast and BE reports skipped rowsets (indicating the cache entry was cleared).

liaoxin01 · 2026-04-24T12:47:34Z

/review

github-actions · 2026-04-24T14:47:42Z

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24890284152

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

Copilot AI review requested due to automatic review settings April 24, 2026 08:21

liaoxin01 requested review from dataroaring and gavinchou as code owners April 24, 2026 08:21

Copilot started reviewing on behalf of liaoxin01 April 24, 2026 08:22 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread ...loud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy

Comment thread ...loud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy

liaoxin01 force-pushed the fix/warmup-job-null-and-cancelled-typo branch from 877b351 to c882b89 Compare April 24, 2026 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire#62805

[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire#62805
liaoxin01 wants to merge 1 commit intoapache:masterfrom
liaoxin01:fix/warmup-job-null-and-cancelled-typo

liaoxin01 commented Apr 24, 2026 •

edited

Loading

Uh oh!

Thearas commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

liaoxin01 commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liaoxin01 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

1. BE — CANCELED vs CANCELLED typo

2. FE — NPE in getTabletReplicaInfos after job removal

Tests

Checklist

Uh oh!

Thearas commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

liaoxin01 commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liaoxin01 commented Apr 24, 2026 •

edited

Loading

1. BE — `CANCELED` vs `CANCELLED` typo

2. FE — NPE in `getTabletReplicaInfos` after job removal