Skip to content

Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769

Open
WenqingLan1 wants to merge 17 commits into
microsoft:mainfrom
WenqingLan1:wenqinglan/refine-gpu-stream
Open

Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769
WenqingLan1 wants to merge 17 commits into
microsoft:mainfrom
WenqingLan1:wenqinglan/refine-gpu-stream

Conversation

@WenqingLan1
Copy link
Copy Markdown
Contributor

@WenqingLan1 WenqingLan1 commented Dec 19, 2025

Refinements:

  • Add support for float (fp32) execution and --data_type <float|double> CLI option for runtime type selection.
  • Refactor CUDA kernels to use 128-bit vectorized accesses (double2 / float4) and move template kernel implementations into a header for cross-TU instantiation. (Required for CUDA template instantiation across compilation units.)
  • Fix allocation buf size bug, args->size is buf size in bytes, not number of elements.
  • Adjust execution/output to single visible GPU (device 0 via CUDA_VISIBLE_DEVICES) and update metric/tag formats (removing gpu_id) plus docs/examples/test log.
  • Updated numa assignment from hard coded numa_alloc_onnode to numa_alloc_local to optimize memory allocation.
  • Rename entry point file from gpu_stream_test.cpp to gpu_stream_main.cpp.

New config:

    gpu-stream:fp64:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 1308622848
        data_type: double
    gpu-stream:fp64-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: double
        check_data: true
    gpu-stream:fp32:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 654311424
        data_type: float
    gpu-stream:fp32-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: float
        check_data: true

New rule:

    gpu-stream:
      statistics:
        - mean
      categories: GPU-STREAM
      aggregate: True
      metrics:
        - gpu-stream:fp(?:32|64)/STREAM_.*_(?:bw|ratio):(\d+)

Example results:

"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:0": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:1": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:2": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:3": 1234

Processed by rules:

| gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw | mean | 1234|

@WenqingLan1 WenqingLan1 requested a review from a team as a code owner December 19, 2025 20:05
@WenqingLan1 WenqingLan1 added the micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks label Dec 19, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.69%. Comparing base (700d650) to head (5cec42c).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #769   +/-   ##
=======================================
  Coverage   85.69%   85.69%           
=======================================
  Files         103      103           
  Lines        7890     7891    +1     
=======================================
+ Hits         6761     6762    +1     
  Misses       1129     1129           
Flag Coverage Δ
cpu-python3.10-unit-test 70.42% <50.00%> (+<0.01%) ⬆️
cpu-python3.12-unit-test 70.42% <50.00%> (+<0.01%) ⬆️
cpu-python3.7-unit-test 69.85% <50.00%> (+<0.01%) ⬆️
cuda-unit-test 83.60% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@guoshzhao guoshzhao self-assigned this Dec 19, 2025
Copilot AI review requested due to automatic review settings February 3, 2026 22:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.

Changes:

  • Add --data_type <float|double> to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests.
  • Refactor CUDA kernels to use 128-bit vectorized accesses (double2 / float4) and move template kernel implementations into a header for cross-TU instantiation.
  • Adjust execution/output to single visible GPU (device 0 via CUDA_VISIBLE_DEVICES) and update metric/tag formats (removing gpu_id) plus docs/examples/test log.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/data/gpu_stream.log Updates golden log output to include data type and new tag format (no gpu_id).
tests/benchmarks/micro_benchmarks/test_gpu_stream.py Extends command-generation assertions to include --data_type (currently only covers double).
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp Removes NUMA/GPU iteration fields from args and adds Opts::data_type.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp Adds CLI parsing/printing for --data_type.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp New entry point replacing the previous main file.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores).
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu Keeps a CUDA compilation unit and moves template implementations to the header.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp Expands bench-args variant to support float and double.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0.
superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt Switches target sources to the new gpu_stream_main.cpp.
superbench/benchmarks/micro_benchmarks/gpu_stream.py Adds --data_type argument and forwards it to the binary.
examples/benchmarks/gpu_stream.py Updates example invocation to include --data_type double.
docs/user-tutorial/benchmarks/micro-benchmarks.md Updates gpu-stream metric patterns to include `(double

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread docs/user-tutorial/benchmarks/micro-benchmarks.md Outdated
Copilot AI review requested due to automatic review settings February 6, 2026 00:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 14 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
@guoshzhao guoshzhao requested a review from abuccts February 13, 2026 00:11
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp Outdated
Copilot AI review requested due to automatic review settings April 8, 2026 20:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99

  • ParseOpts intends to error out when required options are not provided, but size_specified is initialized to true, so missing --size will never be detected by the if (!size_specified || ...) check. Initialize it to false (like the other flags) or remove the required-argument check if defaults are intended.
    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
                parse_err = true;
            }
            break;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp Outdated
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Copilot AI review requested due to automatic review settings April 9, 2026 21:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99

  • ParseOpts initializes size_specified=true, which makes --size effectively optional, but PrintUsage presents --size as required and the end-of-parse validation still checks size_specified. Either initialize size_specified=false to enforce explicit --size, or update the usage/validation logic to reflect that the default buffer size is acceptable.
    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
                parse_err = true;
            }
            break;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated
Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream.py
@microsoft microsoft deleted a comment from Copilot AI Apr 9, 2026
…-stream

# Conflicts:
#	tests/benchmarks/micro_benchmarks/test_gpu_stream.py
Copilot AI review requested due to automatic review settings April 22, 2026 18:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99

  • ParseOpts sets size_specified to true initially, which makes the required-argument validation (if (!size_specified || ...)) ineffective for --size. Either initialize size_specified to false (to truly require --size) or remove size_specified from the required check if --size is intended to be optional via the default.
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
                parse_err = true;
            }
            break;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

polarG

This comment was marked as duplicate.

@polarG polarG dismissed their stale review May 13, 2026 22:56

Superseded by updated review (model attribution removed).

// Generate data to copy
args->sub.data_buf = static_cast<T *>(numa_alloc_onnode(args->size * sizeof(T), args->numa_id));
// Generate data to copy - use local NUMA node for best CPU access
args->sub.data_buf = static_cast<T *>(numa_alloc_local(args->size));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD-FIX] Unchecked numa_alloc_local return → nullptr deref

numa_alloc_local() returns nullptr on failure (insufficient memory, NUMA disabled, etc.). The next loop dereferences args->sub.data_buf[j] without a null check — and the same is true for check_buf on L247. Since this PR is editing these lines (switching from numa_alloc_onnode), it's a good chance to add the check.

args->sub.data_buf = static_cast<T *>(numa_alloc_local(args->size));
if (args->sub.data_buf == nullptr) {
    std::cerr << "PrepareBufAndStream::numa_alloc_local data_buf failed" << std::endl;
    return -1;
}
// ... and after the second alloc:
if (args->sub.check_buf == nullptr) {
    std::cerr << "PrepareBufAndStream::numa_alloc_local check_buf failed" << std::endl;
    numa_free(args->sub.data_buf, args->size);
    return -1;
}

template <typename T> void GpuStream::CreateBenchArgs() {
auto args = std::make_unique<BenchArgs<T>>();
args->gpu_id = 0;
cudaGetDeviceProperties(&args->gpu_device_prop, 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD-FIX] cudaGetDeviceProperties return value not checked

If cudaGetDeviceProperties fails, args->gpu_device_prop is left uninitialized; subsequent reads of memoryBusWidth, ECCEnabled, name, etc. in PrintCudaDeviceInfo and the peak-bandwidth math then operate on garbage.

cudaError_t cuda_err = cudaGetDeviceProperties(&args->gpu_device_prop, 0);
if (cuda_err != cudaSuccess) {
    std::cerr << "CreateBenchArgs::cudaGetDeviceProperties error: " << cuda_err << std::endl;
    return; // and propagate failure (e.g., via a return-code or by leaving variant empty)
}

The void return type makes propagation awkward — consider returning int so Run() can bail out cleanly.

std::cerr << "Run::numa_node_of_cpu failed for cpu " << cpu << std::endl;
return -1;
}
if (numa_run_on_node(local_node) != 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD-FIX] NUMA pinning targets the host process's current node, not the GPU's affinity node

sched_getcpu() + numa_node_of_cpu() returns wherever the kernel happened to schedule this process. Under default_local_mode (8 parallel processes, one per CUDA_VISIBLE_DEVICES={proc_rank}), the OS may pin all 8 processes to the same socket initially, even when the assigned GPU lives on the other socket. The result is that numa_alloc_local allocates on a node that is remote to the GPU — exactly the host↔device path measured by --check_data (cudaMemcpy(..., cudaMemcpyDefault)).

Recommend pinning to the GPU's affinity node instead:

// Query GPU 0's preferred CPU NUMA node (e.g., via NVML)
int gpu_numa = -1;
nvmlDevice_t dev;
if (nvmlInit() == NVML_SUCCESS &&
    nvmlDeviceGetHandleByIndex(0, &dev) == NVML_SUCCESS) {
    nvmlDeviceGetNumaNodeId(dev, &gpu_numa);
    nvmlShutdown();
}
int target = (gpu_numa >= 0) ? gpu_numa : numa_node_of_cpu(sched_getcpu());
if (numa_run_on_node(target) != 0) { ... }

Also note: this numa_run_on_node happens after cudaGetDeviceProperties (which can implicitly init the CUDA context). Migrating threads after CUDA init may cause TLB / socket-affinity surprises.

"_buffer_" + std::to_string(args->size);
for (int j = 0; j < args->sub.times_in_ms[i].size(); j++) {
// STREAM_<Kernelname>_datatype_buffer_<buffer_size>_block_<block_size>
for (size_t i = 0; i < args->sub.times_in_ms.size(); i++) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD-FIX] Metric tag format is a breaking change — flag it explicitly

The output format dropped _gpu_<id> from every STREAM tag (and switched to TAB separation):

  • old: STREAM_COPY_double_gpu_0_buffer_<size>_block_<block>
  • new: STREAM_COPY_double_buffer_<size>_block_<block>

This silently breaks any downstream consumers — existing baseline JSONs, result_summary rules, dashboards, and external scripts that key on the old metric name. The PR description includes a new aggregation rule for the new format, but neither the diff nor the docs include a deprecation note or migration guidance.

Please:

  1. Call out the format change in the PR description / release notes / CHANGELOG.
  2. Update or add the YAML rule (the gpu-stream: aggregation) referenced in the PR description — it isn't in the diff today.
  3. Confirm no in-repo baselines reference the old format (none were found in superbench/config/*.yaml, but please check internal/example baselines).

uint64_t num_bytes_in_thread_block = num_elements_in_thread_block * sizeof(T);
// Each thread processes 128 bits (16 bytes) for optimal memory bandwidth.
// For double: uses double2 (16 bytes). For float: would use float4 (16 bytes).
constexpr uint64_t kBytesPerThread = 16; // 128-bit aligned access
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD-FIX] kBytesPerThread = 16 is decoupled from VecT<T>

The host-side block-size math hard-codes 16 bytes per thread, while the actual vector type is decided in gpu_stream_kernels.hpp via VectorType<T>. If anyone changes double2 → double4, or float4 → float2, or adds a new dtype, this constant silently goes wrong and num_thread_blocks = args->size / num_bytes_in_thread_block will produce a mismatched grid — at best a launch failure, at worst silent OOB.

Recommend deriving from the same source of truth:

constexpr uint64_t kBytesPerThread = sizeof(VecT<T>);
static_assert(kBytesPerThread == 16, "Vector type must be 128-bit aligned for current PTX");

This gives both compile-time safety and a single point of change.

Comment thread tests/data/gpu_stream.log
@@ -2,40 +2,23 @@ STREAM Benchmark
Buffer size(bytes): 4294967296
Number of warm up runs: 10
Number of loops: 40
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD-FIX] Float code path is not exercised end-to-end

The new fp32 path is added in C++ and CLI, but the test data here only contains Data type: double. As a result:

  • _test_gpu_stream_result_parsing only validates parsing of STREAM_*_double_* lines; the float tag format (STREAM_*_float_*) is never exercised.
  • test_gpu_stream_command_generation_cuda_float only verifies that --data_type float appears in the command string — it does not verify that the binary or parser actually handles a float result.

Recommend either:

  1. Adding a second log fixture (e.g. tests/data/gpu_stream_float.log) plus a parametrised parsing test, or
  2. Appending a few STREAM_*_float_* lines to this file and asserting the parser emits both _float_*_bw and _float_*_ratio keys.

uint64_t index = blockIdx.x * blockDim.x + threadIdx.x;
VecT<T> val;
Fetch(val, src + index);
Store(tgt + index, val);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NOTED] Loop-unroll (kNumLoopUnroll = 2) removed in favor of a single 128-bit vector op

Each thread now performs exactly one Fetch + one Store (vs. two unrolled scalar pairs before). On bandwidth-bound STREAM workloads this is usually a wash or a slight win because:

  • 128-bit ld.volatile.global.v2.f64 / v4.f32 matches the L2 → L1 sector size on modern GPUs.
  • volatile qualifiers are preserved, so the compiler still cannot coalesce or cache.

However, please measure on representative GPUs (V100 / A100 / H100 and MI250 / MI300 if HIP support is reactivated later) before merge to confirm no regression on the latency-bound side (e.g., consumer GPUs without HBM).

if (gpu_count < 1) {
std::cerr << "Run::No GPU available" << std::endl;
return -1;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NOTED] Single-GPU execution by design — relies on default_local_mode

Multi-GPU iteration was removed; the binary now only benchmarks cudaSetDevice(0). Under default_local_mode (prefix: CUDA_VISIBLE_DEVICES={proc_rank}, proc_num: 8), this is correct — Superbench fans out one process per GPU. The change does, however:

  • Break standalone usage (./bin/gpu_stream no longer measures every GPU on the host).
  • Couple the binary to Superbench's process model (now reflected in the function comment).

This is intentional per the PR description; the only ask is to call this behavior change out in the user-tutorial doc (docs/user-tutorial/benchmarks/micro-benchmarks.md) so users running the binary directly are not surprised.

val_b.x += (val_a.x * scalar);
val_b.y += (val_a.y * scalar);
} else if constexpr (std::is_same<T, float>::value) {
val_b.x += (val_a.x * scalar);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NON-BLOCKING] memcmp validation of float results is fragile

CheckBuf (in gpu_stream.cu, unchanged but newly reachable for T = float) compares result buffers byte-for-byte via memcmp. For TRIAD specifically, the GPU may compile val_b.x += val_a.x * scalar to a single FMA (one rounding) while the host validation buffer in PrepareValidationBuf is computed with two separate roundings.

For the current test values (j % 256 ∈ [0, 255], scalar = 11.0) every intermediate is exactly representable in IEEE-754 single precision, so memcmp does agree on this PR. But this is a fragile invariant — increasing kUInt8Mod or changing scalar could silently break --check_data for fp32 only.

Recommend, when convenient, switching to a tolerance-based comparison for float:

constexpr float kRelTol = 1e-6f;
for (size_t i = 0; i < n; ++i) {
    if (std::abs(actual[i] - expected[i]) > kRelTol * std::abs(expected[i])) ...
}

@@ -235,15 +236,15 @@ template <typename T> int GpuStream::PrepareBufAndStream(std::unique_ptr<BenchAr
cudaError_t cuda_err = cudaSuccess;

if (args->check_data) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NOTED] Host-RAM cost when --check_data is on

With the default --size = 4 GiB, every process allocates data_buf + check_buf = 8 GiB of host memory plus three GPU buffers. Under default_local_mode × 8 processes × --check_data, that's ~64 GiB of host RAM just for STREAM validation — enough to OOM smaller dev nodes.

The new YAML configs in the PR description avoid this by overriding to ~1.2 GiB, so this is not a blocker, just a watch-out: please consider documenting the per-process memory cost of --check_data next to the --size flag in the user tutorial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants