Rewrite mul reduce to use fdot2 instructions by pfultz2 · Pull Request #4787 · ROCm/AMDMIGraphX

pfultz2 · 2026-04-15T15:49:26Z

Motivation

Technical Details

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

- Added: New functionality.
- Changed: Changes to existing functionality.
- Removed: Functionality or support that has been removed. (Compared to a previous release)
- Optimized: Component performance that has been optimized or improved.
- Resolved Issues: Known issues from a previous version that have been resolved.
- Not Applicable: This PR is not to be included in the changelog.

Copilot

Pull request overview

This PR adds a GPU-side optimization to fuse reduce_sum(pointwise(mul(x,y))) into a specialized reduction path that can leverage vector dot-product instructions (e.g., fdot2) during codegen, and adds kernel-level tests for the new vector dot helper.

Changes:

Add migraphx::vec_dot (generic + HW-specialized overloads) to support dot-product style reads in GPU reductions.
Introduce gpu::mul_reduce_sum and a prepare_reduce rewrite to replace reduce_sum(pointwise(mul)) with the specialized op.
Extend GPU reduce codegen to emit vec_dot as the read function for gpu::mul_reduce_sum, plus add kernel tests for vec.hpp.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`test/gpu/kernels/vec.cpp`	Adds kernel-level tests covering vec utilities and new `vec_dot` behavior.
`src/targets/gpu/prepare_reduce.cpp`	Adds `gpu::mul_reduce_sum` op and a rewrite pass to detect/rewrite mul+reduce_sum patterns.
`src/targets/gpu/kernels/include/migraphx/kernels/vec.hpp`	Adds `vec_dot` (generic + half/bf16/int8 HW overloads).
`src/targets/gpu/kernels/include/migraphx/kernels/type_traits.hpp`	Tightens `MIGRAPHX_REQUIRES` macro by parenthesizing the condition.
`src/targets/gpu/compile_gen.cpp`	Updates reduce codegen to handle `gpu::mul_reduce_sum` by using `vec_dot` as the read.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T15:59:34Z

    // rewrite argmin/argmax to handle tuples
    rewrite_arg_reduce(m);
+    rewrite_mul_reduce_sum(m);
    fuse_reductions(m);


This PR introduces a new rewrite_mul_reduce_sum transformation that changes the reduce kernel IR (introducing gpu::mul_reduce_sum), but there are no corresponding unit tests validating when the rewrite should/should not apply. There is already test coverage for gpu::prepare_reduce in test/gpu/prepare_reduce.cpp; adding cases for the mul+reduce_sum pattern (including a negative case for dynamic/runtime axes) would help prevent regressions.

Copilot · 2026-04-15T15:59:34Z

+        if(ins->name() != "reduce_sum")
+            continue;
+        auto pw = ins->inputs().front();
+        if(pw->name() != "pointwise")
+            continue;
+        if(not is_only_mul(*pw->module_inputs().front()))
+            continue;
+        auto axes = ins->get_operator().to_value()["axes"].to_vector<std::int64_t>();
+        auto mrs  = m.insert_instruction(ins, mul_reduce_sum{std::move(axes)}, pw->inputs());
+        m.replace_instruction(ins, mrs);


rewrite_mul_reduce_sum rewrites every reduce_sum whose first input is a pointwise(mul) without checking whether the reduction axes are provided as a runtime input (i.e., axes attribute is empty and reduce_sum has a second input). In that dynamic-axes case this rewrite drops the axes input entirely (since it always uses pw->inputs()), changing semantics and potentially producing an incorrect output shape.

Consider guarding the rewrite to only trigger when reduce_sum has static axes (e.g., ins->inputs().size() == 1 and/or the axes attribute is non-empty), or alternatively preserving the axes input and implementing mul_reduce_sum to support runtime axes as well.

pfultz2 added 9 commits March 27, 2026 11:24

Rewrite mul reduce

a094f70

Format

bb93523

Handle larger vec size

c64a041

Format

a448bac

Fix bug in accumulation

0e98624

Format

ac15e33

Update to handle more data types

26a2a3a

Add vec unit tests

e0928c0

Format

1e5def1

Copilot AI review requested due to automatic review settings April 15, 2026 15:49

pfultz2 requested a review from causten as a code owner April 15, 2026 15:49

Copilot started reviewing on behalf of pfultz2 April 15, 2026 15:50 View session

pfultz2 marked this pull request as draft April 15, 2026 15:51

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Merge branch 'develop' into rewrite-mul-reduce

efda36f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite mul reduce to use fdot2 instructions#4787

Rewrite mul reduce to use fdot2 instructions#4787
pfultz2 wants to merge 10 commits intodevelopfrom
rewrite-mul-reduce

pfultz2 commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pfultz2 commented Apr 15, 2026

Motivation

Technical Details

Changelog Category

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants