Skip to content

[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire#62805

Open
liaoxin01 wants to merge 1 commit intoapache:masterfrom
liaoxin01:fix/warmup-job-null-and-cancelled-typo
Open

[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire#62805
liaoxin01 wants to merge 1 commit intoapache:masterfrom
liaoxin01:fix/warmup-job-null-and-cancelled-typo

Conversation

@liaoxin01
Copy link
Copy Markdown
Contributor

@liaoxin01 liaoxin01 commented Apr 24, 2026

Proposed changes

Fix two related bugs on the event-driven warm-up path. Together they
stall BE's heavy work pool when a warm-up job is cancelled or expired
while BE still has its id in _tablet_replica_cache.

1. BE — CANCELED vs CANCELLED typo

be/src/cloud/cloud_warm_up_manager.cpp used st.is<CANCELED>()
(one L). CANCELED is not in ErrorCode; ADL resolved it to
PCacheStatus::CANCELED = 9 from a proto enum, so the check compared
against 9 and was always false. When FE returned TStatusCode.CANCELLED
(value 1) to tell BE a job was done, BE never pushed the job_id into
cancelled_jobs, leaving a zombie entry in _tablet_replica_cache
that every subsequent commit_rowset re-queried.

Fix: use st.is<ErrorCode::CANCELLED>(), matching the same
namespace-qualified form used elsewhere in the file.

2. FE — NPE in getTabletReplicaInfos after job removal

if (job == null || job.isDone()) {
    LOG.info("warmup job {} is not running, notify caller BE {} to cancel job",
            job.getJobId(), clientAddr);   // NPE when job == null
    ...
}

Once a cancelled job was removed from cloudWarmUpJobs past
history_cloud_warm_up_job_keep_max_second, job is null and the log
call NPE'd. With bug #1 keeping the stale id in BE cache, BE kept
hitting this RPC forever; each failure took the
apache::thrift::TException branch in thrift_rpc_helper.cpp, which
sleeps 2s inside CloudWarmUpManager::_mtx. That serialised
bthread_fork_join(commit_rowset), blocked heavy-pool threads in
CloudTabletsChannel::close, and backed up the heavy-pool queue —
leading to load timeouts and query Fragment RPC Phase1 latency in
the 10s range.

Fix: log request.getWarmUpJobId() instead; it is guaranteed set by
the enclosing request.isSetWarmUpJobId() check.

Tests

  • fe/fe-core/src/test/java/org/apache/doris/service/GetTabletReplicaInfosTest.java
    — pure-Mockito unit test. Forces CacheHotspotManager.getCloudWarmUpJob
    to return null and asserts getTabletReplicaInfos does not throw
    and returns TStatusCode.CANCELLED.

  • regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy
    — docker cloud regression. Shortens
    history_cloud_warm_up_job_keep_max_second=30, creates an
    event-driven LOAD warm-up job, cancels it, waits for FE to remove
    it, then verifies (a) a follow-up batch of inserts completes in
    <30s (buggy path would need ≥40s from 2s per-commit sleeps) and
    (b) BE's file_cache_event_driven_warm_up_skipped_rowset_num
    metric grows, indicating the cache entry was cleared.

Checklist

  • Bug fix
  • Unit test
  • Regression test

Copilot AI review requested due to automatic review settings April 24, 2026 08:21
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 24, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two bugs in the cloud warm-up cancellation/expiry flow that could leave BE with a stale warm-up job cache entry, causing repeated FE RPCs and heavy-work-pool stalls.

Changes:

  • BE: Fixes the cancelled-status check in CloudWarmUpManager by using ErrorCode::CANCELLED so cancelled jobs are correctly purged from _tablet_replica_cache.
  • FE: Fixes an NPE in FrontendServiceImpl.getTabletReplicaInfos() when the warm-up job has already been removed, returning TStatusCode.CANCELLED safely.
  • Tests: Adds a FE unit test for the null-job path and a docker regression test covering cancel+expire behavior and performance/metrics signals.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
be/src/cloud/cloud_warm_up_manager.cpp Corrects CANCELLED detection so stale warm-up job cache entries get erased.
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java Prevents NPE when warm-up job is missing/expired; returns CANCELLED.
fe/fe-core/src/test/java/org/apache/doris/service/GetTabletReplicaInfosTest.java Unit test ensuring null warm-up job does not throw and returns CANCELLED.
regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy New docker regression validating cache purge + avoiding slow commits after FE job removal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two related bugs on the event-driven warm-up path caused heavy_work_pool
to stall when a warm-up job was cancelled or expired while BE still held
its id in _tablet_replica_cache.

1. BE side: CloudWarmUpManager::get_replica_info used
   `st.is<CANCELED>()`. `CANCELED` (one L) is not defined in ErrorCode;
   ADL resolved it to PCacheStatus::CANCELED = 9 from a proto enum,
   so the check was never true. When FE returned TStatus CANCELLED to
   tell BE a job was done, BE never pushed the job id into
   `cancelled_jobs`, leaving a zombie entry in _tablet_replica_cache
   that every subsequent commit_rowset re-queried.

2. FE side: FrontendServiceImpl.getTabletReplicaInfos, on the
   `job == null || job.isDone()` branch, still called `job.getJobId()`
   inside the log message. Once the job had been removed from
   cloudWarmUpJobs (past history_cloud_warm_up_job_keep_max_second),
   that call NPE'd. BE then took the thrift exception path and slept
   2 seconds inside CloudWarmUpManager::_mtx, serialising every
   commit_rowset and backing up the heavy work pool.

Fix BE to use `ErrorCode::CANCELLED` (matching the rest of the file),
and log `request.getWarmUpJobId()` on the null-job branch.

Tests:
- GetTabletReplicaInfosTest: pure-Mockito unit test asserting that
  getTabletReplicaInfos does not NPE and returns CANCELLED when the
  job has been removed from CacheHotspotManager.
- test_warm_up_cluster_event_cancel_expired.groovy: docker cloud
  regression that shortens history_cloud_warm_up_job_keep_max_second,
  cancels an event-driven LOAD warm-up job, waits for FE to remove
  it, and verifies post-removal inserts stay fast and BE reports
  skipped rowsets (indicating the cache entry was cleared).
@liaoxin01 liaoxin01 force-pushed the fix/warmup-job-null-and-cancelled-typo branch from 877b351 to c882b89 Compare April 24, 2026 12:25
@liaoxin01
Copy link
Copy Markdown
Contributor Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24890284152

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants