Skip to content

Support TBG inside Pipeline#783

Open
caiomcbr wants to merge 6 commits into
mainfrom
caiorocha/support_tbg_pipeline
Open

Support TBG inside Pipeline#783
caiomcbr wants to merge 6 commits into
mainfrom
caiorocha/support_tbg_pipeline

Conversation

@caiomcbr

@caiomcbr caiomcbr commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

This PR adds support for linked thread block groups in pipelines, allowing multiple thread blocks to cooperate on processing each pipeline slice and increasing the available parallelism within pipeline iterations.

image

As shown in the image, each pipeline iteration processes a different slice of data, with each thread block within the threadblock group managing a portion of that slice. The final slice in the pipeline may differ in size from the others, and in such cases, the thread block groups will adjust accordingly to handle it.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for thread-block-group (TBG) slicing within PIPELINE execution by carrying TBG metadata (tbId, tbgSize) in each Operation and applying per-slice offsets/sizes inside the device execution handlers.

Changes:

  • Extend Operation with tbId / tbgSize and populate them from tbg_info in the JSON execution plan.
  • Add device-side calcOffset/calcSize and apply TBG slicing across multiple kernel handlers (GET/PUT/COPY/REDUCE/packet ops).
  • Remove host-side per-TBG pre-slicing in ExecutionPlan::Impl::setupOperation (offsets/sizes remain “full”, slicing moves to device).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/core/include/execution_kernel.hpp Implements device-side TBG slicing and applies it in multiple operation handlers, including within PIPELINE iterations.
src/core/include/execution_common.hpp Adds tbId / tbgSize fields to Operation so the device kernel can apply TBG slicing.
src/core/executor/execution_plan.cc Parses tbg_info from JSON and stores it into Operation instead of pre-slicing offsets/sizes on the host.

Comment on lines +69 to +80
MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) {
constexpr uint32_t alignment = 16;
uint32_t nelems = size / alignment;
uint32_t minNelems = nelems / slices;
uint32_t remainder = nelems % slices;
uint32_t off = index * minNelems + (index < remainder ? index : remainder);
return off * alignment;
}

MSCCLPP_DEVICE_INLINE uint32_t calcSize(uint32_t size, uint32_t index, uint32_t slices) {
return calcOffset(size, index + 1, slices) - calcOffset(size, index, slices);
}

Copilot AI Apr 11, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calcOffset/calcSize compute nelems = size / 16, which drops any remaining bytes (<16). In pipeline mode this can silently skip data whenever unitSize (or the remaining bytes in the last iteration) isn’t a multiple of 16, and can even make calcSize() return 0 for size < 16.

Consider either (a) enforcing/validating that unitSize and all buffer sizes processed here are multiples of 16, or (b) updating the slicing logic to distribute the tail bytes so the full size is covered across slices.

Copilot uses AI. Check for mistakes.
Comment thread src/core/include/execution_kernel.hpp
Comment thread src/core/include/execution_kernel.hpp
Comment thread src/core/include/execution_kernel.hpp
Comment on lines +69 to +75
MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) {
constexpr uint32_t alignment = 16;
uint32_t nelems = size / alignment;
uint32_t minNelems = nelems / slices;
uint32_t remainder = nelems % slices;
uint32_t off = index * minNelems + (index < remainder ? index : remainder);
return off * alignment;

Copilot AI Apr 11, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calcOffset hardcodes alignment = 16, but ExecutionPlan supports a configurable buffer_alignment (default 16). If a plan sets buffer_alignment != 16, host-side offsets/sizes are computed on that alignment while device-side TBG slicing is computed on 16B units, which can produce inconsistent slicing and misaligned accesses.

Consider plumbing the plan’s bufferAlignment to device code (e.g., store it in DeviceExecutionPlan/Operation) or explicitly enforcing buffer_alignment == 16 when tbg_info is used.

Copilot uses AI. Check for mistakes.
@caiomcbr caiomcbr requested a review from a team April 23, 2026 21:39
Comment on lines +540 to +541
inputOffset += calcOffset(inputBufferSize, 0, 1);
inputBufferSize = calcSize(inputBufferSize, 0, 1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of 0 and 1 here. Maybe give a meaningful name?

Comment on lines +577 to +578
outputOffset += calcOffset(outputBufferSize, 0, 1);
outputBufferSize = calcSize(outputBufferSize, 0, 1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

}

MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) {
constexpr uint32_t alignment = 16;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need alignment here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants