[fix](pipeline) guard CountedFinishDependency::sub() against counter underflow#62815
[fix](pipeline) guard CountedFinishDependency::sub() against counter underflow#62815BiteTheDDDDt wants to merge 1 commit intoapache:masterfrom
Conversation
…underflow
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: N/A
Problem Summary:
`CountedFinishDependency::sub()` decrements an unsigned `uint32_t _counter`
without checking that the counter is greater than zero. The contract is that
each `sub()` is balanced by a prior `add()`, but if the protocol is ever
violated (e.g. a duplicated `sync_filter_size` RPC from FE, double callback
on the BE side, or any future bug) the `0--` silently underflows to
`UINT32_MAX`. The condition `if (!_counter) set_ready();` is then never true
and the dependency stays blocked forever, hanging the entire query with no
log line and no crash, making the bug very hard to diagnose.
This is especially relevant because there are multiple callers of `sub()`:
- `SyncSizeCallback::call()` on RPC failure
- `SyncSizeCallback::call()` on RPC success but error response status
- `RuntimeFilterProducer::set_synced_size()` triggered by FE
Any imbalance between these and the single `add()` in
`RuntimeFilterProducer::set_dependency()` would produce the silent hang.
Add an explicit check at the top of `sub()` that throws an
`Exception(ErrorCode::INTERNAL_ERROR, ...)` carrying `debug_string()` when
`_counter == 0`. This matches the project convention of asserting
correctness and failing fast rather than letting the process limp along in
a corrupt state.
### Release note
Fix a potential silent query hang caused by an unbalanced
`CountedFinishDependency::sub()` call underflowing the counter. Such cases
now throw an explicit `INTERNAL_ERROR` instead.
### Check List (For Author)
- Test:
- [x] Unit Test: added `CountedFinishDependencyTest` in
`be/test/exec/runtime_filter/sync_size_callback_test.cpp` covering
sub() on a fresh counter, sub() after balanced add/sub, and the happy
path of multiple add() / sub() reaching ready. All pass.
- [x] Existing `SyncSizeCallbackTest.*` re-run, no regression.
- Behavior changed: No (the only observable change is converting an existing
silent failure into a loud error when the contract is violated)
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR hardens the pipeline runtime-filter dependency mechanism by preventing CountedFinishDependency::sub() from silently underflowing its unsigned counter (which can otherwise leave the dependency permanently blocked and hang a query).
Changes:
- Add an explicit underflow guard in
CountedFinishDependency::sub()that throws anINTERNAL_ERRORwhen_counter == 0. - Add unit tests covering the new underflow behavior and verifying the normal balanced add/sub readiness behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
be/src/exec/pipeline/dependency.h |
Adds an underflow check in CountedFinishDependency::sub() to avoid silent hangs. |
be/test/exec/runtime_filter/sync_size_callback_test.cpp |
Adds gtest cases validating underflow throws and balanced counter behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void sub() { | ||
| std::unique_lock<std::mutex> l(_mtx); | ||
| // _counter is unsigned: a stray sub() when counter is already 0 would | ||
| // underflow to UINT32_MAX and the dependency would never become ready, | ||
| // hanging the query forever. Fail loudly instead. | ||
| if (_counter == 0) [[unlikely]] { | ||
| throw Exception(ErrorCode::INTERNAL_ERROR, | ||
| "CountedFinishDependency::sub() underflow on {}", debug_string()); | ||
| } |
There was a problem hiding this comment.
CountedFinishDependency::sub() now throws doris::Exception / uses ErrorCode::INTERNAL_ERROR in an inline method, but dependency.h doesn’t include the header that defines them. Please add the appropriate include (e.g. common/exception.h) here so this header compiles standalone and doesn’t rely on transitive includes.
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
Related PR: N/A
Problem Summary:
CountedFinishDependency::sub()decrements an unsigneduint32_t _counterwithout checking that the counter is greater than zero. The contract is that
each
sub()is balanced by a prioradd(), but if the protocol is everviolated (e.g. a duplicated
sync_filter_sizeRPC from FE, double callbackon the BE side, or any future bug) the
0--silently underflows toUINT32_MAX. The conditionif (!_counter) set_ready();is then never trueand the dependency stays blocked forever, hanging the entire query with no
log line and no crash, making the bug very hard to diagnose.
This is especially relevant because there are multiple callers of
sub():SyncSizeCallback::call()on RPC failureSyncSizeCallback::call()on RPC success but error response statusRuntimeFilterProducer::set_synced_size()triggered by FEAny imbalance between these and the single
add()inRuntimeFilterProducer::set_dependency()would produce the silent hang.Add an explicit check at the top of
sub()that throws anException(ErrorCode::INTERNAL_ERROR, ...)carryingdebug_string()when_counter == 0. This matches the project convention of assertingcorrectness and failing fast rather than letting the process limp along in
a corrupt state.
Release note
Fix a potential silent query hang caused by an unbalanced
CountedFinishDependency::sub()call underflowing the counter. Such casesnow throw an explicit
INTERNAL_ERRORinstead.Check List (For Author)
Test:
CountedFinishDependencyTestinbe/test/exec/runtime_filter/sync_size_callback_test.cppcoveringsub() on a fresh counter, sub() after balanced add/sub, and the happy
path of multiple add() / sub() reaching ready. All pass.
SyncSizeCallbackTest.*re-run, no regression.Behavior changed: No (the only observable change is converting an existing
silent failure into a loud error when the contract is violated)
Does this need documentation: No