Support TBG inside Pipeline by caiomcbr · Pull Request #783 · microsoft/mscclpp

caiomcbr · 2026-04-11T21:23:48Z

This PR adds support for linked thread block groups in pipelines, allowing multiple thread blocks to cooperate on processing each pipeline slice and increasing the available parallelism within pipeline iterations.

As shown in the image, each pipeline iteration processes a different slice of data, with each thread block within the threadblock group managing a portion of that slice. The final slice in the pipeline may differ in size from the others, and in such cases, the thread block groups will adjust accordingly to handle it.

Copilot

Pull request overview

This PR adds support for thread-block-group (TBG) slicing within PIPELINE execution by carrying TBG metadata (tbId, tbgSize) in each Operation and applying per-slice offsets/sizes inside the device execution handlers.

Changes:

Extend Operation with tbId / tbgSize and populate them from tbg_info in the JSON execution plan.
Add device-side calcOffset/calcSize and apply TBG slicing across multiple kernel handlers (GET/PUT/COPY/REDUCE/packet ops).
Remove host-side per-TBG pre-slicing in ExecutionPlan::Impl::setupOperation (offsets/sizes remain “full”, slicing moves to device).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`src/core/include/execution_kernel.hpp`	Implements device-side TBG slicing and applies it in multiple operation handlers, including within `PIPELINE` iterations.
`src/core/include/execution_common.hpp`	Adds `tbId` / `tbgSize` fields to `Operation` so the device kernel can apply TBG slicing.
`src/core/executor/execution_plan.cc`	Parses `tbg_info` from JSON and stores it into `Operation` instead of pre-slicing offsets/sizes on the host.

Copilot · 2026-04-11T21:29:06Z

+MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) {
+  constexpr uint32_t alignment = 16;
+  uint32_t nelems = size / alignment;
+  uint32_t minNelems = nelems / slices;
+  uint32_t remainder = nelems % slices;
+  uint32_t off = index * minNelems + (index < remainder ? index : remainder);
+  return off * alignment;
+}
+
+MSCCLPP_DEVICE_INLINE uint32_t calcSize(uint32_t size, uint32_t index, uint32_t slices) {
+  return calcOffset(size, index + 1, slices) - calcOffset(size, index, slices);
+}


calcOffset/calcSize compute nelems = size / 16, which drops any remaining bytes (<16). In pipeline mode this can silently skip data whenever unitSize (or the remaining bytes in the last iteration) isn’t a multiple of 16, and can even make calcSize() return 0 for size < 16.

Consider either (a) enforcing/validating that unitSize and all buffer sizes processed here are multiples of 16, or (b) updating the slicing logic to distribute the tail bytes so the full size is covered across slices.

Copilot · 2026-04-11T21:29:07Z

+MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) {
+  constexpr uint32_t alignment = 16;
+  uint32_t nelems = size / alignment;
+  uint32_t minNelems = nelems / slices;
+  uint32_t remainder = nelems % slices;
+  uint32_t off = index * minNelems + (index < remainder ? index : remainder);
+  return off * alignment;


calcOffset hardcodes alignment = 16, but ExecutionPlan supports a configurable buffer_alignment (default 16). If a plan sets buffer_alignment != 16, host-side offsets/sizes are computed on that alignment while device-side TBG slicing is computed on 16B units, which can produce inconsistent slicing and misaligned accesses.

Consider plumbing the plan’s bufferAlignment to device code (e.g., store it in DeviceExecutionPlan/Operation) or explicitly enforcing buffer_alignment == 16 when tbg_info is used.

Binyang2014 · 2026-06-10T05:06:58Z

+      inputOffset += calcOffset(inputBufferSize, 0, 1);
+      inputBufferSize = calcSize(inputBufferSize, 0, 1);


What's the meaning of 0 and 1 here. Maybe give a meaningful name?

Binyang2014 · 2026-06-10T05:07:27Z

+      outputOffset += calcOffset(outputBufferSize, 0, 1);
+      outputBufferSize = calcSize(outputBufferSize, 0, 1);


Binyang2014 · 2026-06-10T05:09:16Z

 }

+MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) {
+  constexpr uint32_t alignment = 16;


Do we need alignment here?

caiomcbr added 4 commits April 6, 2026 23:18

wip

9cbe78c

wip

0a26f9d

Merge branch 'main' into caiorocha/support_tbg_pipeline

7f5ff93

wip

bc62843

caiomcbr requested review from Binyang2014 and Copilot April 11, 2026 21:23

Copilot started reviewing on behalf of caiomcbr April 11, 2026 21:24 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

caiomcbr added 2 commits April 13, 2026 21:19

Merge branch 'main' into caiorocha/support_tbg_pipeline

85bcc9a

Merge branch 'main' into caiorocha/support_tbg_pipeline

555ab5c

caiomcbr requested a review from a team April 23, 2026 21:39

Binyang2014 reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TBG inside Pipeline#783

Support TBG inside Pipeline#783
caiomcbr wants to merge 6 commits into
mainfrom
caiorocha/support_tbg_pipeline

caiomcbr commented Apr 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Binyang2014 Jun 10, 2026

Uh oh!

Binyang2014 Jun 10, 2026

Uh oh!

Binyang2014 Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		inputOffset += calcOffset(inputBufferSize, 0, 1);
		inputBufferSize = calcSize(inputBufferSize, 0, 1);

		outputOffset += calcOffset(outputBufferSize, 0, 1);
		outputBufferSize = calcSize(outputBufferSize, 0, 1);

Conversation

caiomcbr commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

caiomcbr commented Apr 11, 2026 •

edited

Loading