Improve yolov10n Performance by ahsan-ca · Pull Request #4785 · ROCm/AMDMIGraphX

ahsan-ca · 2026-04-14T21:19:20Z

Motivation

Improve yolov10n Performance

Technical Details

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

- Added: New functionality.
- Changed: Changes to existing functionality.
- Removed: Functionality or support that has been removed. (Compared to a previous release)
- Optimized: Component performance that has been optimized or improved.
- Resolved Issues: Known issues from a previous version that have been resolved.
- Not Applicable: This PR is not to be included in the changelog.

Copilot

Pull request overview

This PR introduces HIP graph capture support and new algebraic rewrites intended to improve GPU inference performance (notably for yolov10n), along with a couple of GPU-heuristic tweaks.

Changes:

Extend the (type-erased) migraphx::context API to support capture lifecycle (start_capture/end_capture/get_capture) and implement it for the GPU context using HIP graphs.
Add a program::eval fast-path that replays a previously captured HIP graph when MIGRAPHX_ENABLE_HIP_GRAPH is enabled.
Add/adjust GPU optimization heuristics, including a new convolution-convolution(1x1) fusion in simplify_algebra and an is_mlir_conv policy change.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tools/include/context.hpp	Extends generated context interface with capture-related virtuals.
src/include/migraphx/context.hpp	Adds capture hooks to the type-erased `context` implementation.
src/targets/gpu/include/migraphx/gpu/context.hpp	Implements HIP stream capture + graph instantiate/launch and stores a replay callback.
src/program.cpp	Uses capture hooks in `program::eval` to cache and replay a HIP graph.
src/include/migraphx/program.hpp	Declares `MIGRAPHX_ENABLE_HIP_GRAPH` env var.
src/targets/gpu/compile_hip_code_object.cpp	Adjusts global sizing logic when HIP graph mode is enabled.
src/driver/verify.cpp	Adds an extra warm-up eval to populate the HIP graph capture.
src/simplify_algebra.cpp	Adds a new conv->1x1 conv fusion that composes weights via reshape/transpose/dot.
src/targets/gpu/fuse_mlir.cpp	Changes MLIR-conv eligibility heuristic in `fast` mode (and leaves old logic commented).
src/layout_convolution.cpp	Simplifies redundant layout removal condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T21:28:31Z

+        auto mw_dims = wnxn->get_shape().lens();
+        mw_dims[1] *= groups;
+
+        auto w1x1_reshaped = m.insert_instruction(
+            ins,
+            make_op("reshape", {{"dims", {out_channels, groups, mid_channels / groups}}}),
+            w1x1);
+        auto w1x1_grouped = m.insert_instruction(
+            ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), w1x1_reshaped);
+
+        auto wnxn_reshaped = m.insert_instruction(
+            ins,
+            make_op("reshape",
+                    {{"dims", {groups, mid_channels / groups, in_channels_per_group * w_size}}}),
+            wnxn);
+
+        auto mw = m.insert_instruction(ins, make_op("dot"), w1x1_grouped, wnxn_reshaped);
+        auto mw_transposed =
+            m.insert_instruction(ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), mw);
+        auto mw_reshaped =
+            m.insert_instruction(ins, make_op("reshape", {{"dims", mw_dims}}), mw_transposed);
+
+        auto op = input->get_operator();
+        op.from_value({{"group", 1}});
+        auto conv = m.insert_instruction(ins, op, x_ins, mw_reshaped);


find_conv_conv_1x1 builds mw_dims from the first convolution’s weights (wnxn) and only updates mw_dims[1]. This leaves mw_dims[0] as the inner conv’s output channels, but the fused weights produced by the dot have leading dimension out_channels (from the 1x1 conv). If out_channels != mid_channels (common), the final reshape will be invalid (element-count mismatch) and/or the replacement convolution will produce the wrong output shape. Update the fused weight shape so dim-0 matches out_channels (and add shape/divisibility checks such as mid_channels % groups == 0) before inserting reshapes/dot.

Copilot · 2026-04-14T21:28:32Z

+// (x * w1) * w2 => x * (w1 * w2)
+struct find_conv_conv_1x1
+{
+    auto matcher() const
+    {
+        return conv_1x1(
+            match::arg(0)(match::used_once(), match::name("convolution").bind("input")));
+    }
+
+    void apply(module& m, const match::matcher_result& r) const
+    {
+        auto ins   = r.result;
+        auto input = r.instructions["input"];
+        auto x_ins = input->inputs().front();
+        auto wnxn  = input->inputs()[1];
+        auto w1x1  = ins->inputs()[1];
+
+        auto out_channels          = w1x1->get_shape().lens()[0];
+        auto mid_channels          = w1x1->get_shape().lens()[1];
+        auto in_channels_per_group = wnxn->get_shape().lens()[1];
+        auto groups                = x_ins->get_shape().lens()[1] / in_channels_per_group;
+        auto w_size                = std::accumulate(wnxn->get_shape().lens().begin() + 2,
+                                      wnxn->get_shape().lens().end(),
+                                      std::size_t{1},
+                                      std::multiplies<>{});
+
+        auto mw_dims = wnxn->get_shape().lens();
+        mw_dims[1] *= groups;
+
+        auto w1x1_reshaped = m.insert_instruction(
+            ins,
+            make_op("reshape", {{"dims", {out_channels, groups, mid_channels / groups}}}),
+            w1x1);
+        auto w1x1_grouped = m.insert_instruction(
+            ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), w1x1_reshaped);
+
+        auto wnxn_reshaped = m.insert_instruction(
+            ins,
+            make_op("reshape",
+                    {{"dims", {groups, mid_channels / groups, in_channels_per_group * w_size}}}),
+            wnxn);
+
+        auto mw = m.insert_instruction(ins, make_op("dot"), w1x1_grouped, wnxn_reshaped);
+        auto mw_transposed =
+            m.insert_instruction(ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), mw);
+        auto mw_reshaped =
+            m.insert_instruction(ins, make_op("reshape", {{"dims", mw_dims}}), mw_transposed);
+
+        auto op = input->get_operator();
+        op.from_value({{"group", 1}});
+        auto conv = m.insert_instruction(ins, op, x_ins, mw_reshaped);
+        m.replace_instruction(ins, conv);
+    }


This new convolution+1x1-convolution fusion introduces new behavior in simplify_algebra, but there’s no corresponding test coverage validating the rewrite (especially for the common case where the 1x1 conv changes channel count and/or where the first conv is grouped/depthwise). Please add a unit test in test/simplify_algebra_test.cpp that exercises the rewrite and checks the optimized module is equivalent and preserves output shapes.

Copilot · 2026-04-14T21:28:32Z

+        auto graph           = share(hip_graph_ptr{raw_graph});
+        if(status != hipSuccess)
+            MIGRAPHX_THROW("Failed: hipStreamEndCapture: " + hip_error(status));
+
+        // auto log = make_shared_array<char>(1024);
+        hipGraphExec_t raw_graph_exec = nullptr;
+        status          = hipGraphInstantiate(&raw_graph_exec, graph.get(), nullptr, nullptr, 0);
+        auto graph_exec = share(hip_graph_exec_ptr{raw_graph_exec});
+        if(status != hipSuccess)
+            MIGRAPHX_THROW("Failed: hipGraphInstantiate: " + hip_error(status));
+        saved_graph = [this, graph, graph_exec, args] {


end_capture wraps raw_graph in hip_graph_ptr (and raw_graph_exec in hip_graph_exec_ptr) before checking the HIP return status. If hipStreamEndCapture/hipGraphInstantiate fails, these RAII deleters may run on an invalid/null handle during stack unwinding. Construct the managed pointers only after confirming status == hipSuccess (and consider explicitly handling a null returned graph).

Suggested change

auto graph = share(hip_graph_ptr{raw_graph});

if(status != hipSuccess)

MIGRAPHX_THROW("Failed: hipStreamEndCapture: " + hip_error(status));

// auto log = make_shared_array<char>(1024);

hipGraphExec_t raw_graph_exec = nullptr;

status = hipGraphInstantiate(&raw_graph_exec, graph.get(), nullptr, nullptr, 0);

auto graph_exec = share(hip_graph_exec_ptr{raw_graph_exec});

if(status != hipSuccess)

MIGRAPHX_THROW("Failed: hipGraphInstantiate: " + hip_error(status));

saved_graph = [this, graph, graph_exec, args] {

if(status != hipSuccess)

MIGRAPHX_THROW("Failed: hipStreamEndCapture: " + hip_error(status));

if(raw_graph == nullptr)

MIGRAPHX_THROW("Failed: hipStreamEndCapture returned a null graph");

auto graph = share(hip_graph_ptr{raw_graph});

// auto log = make_shared_array<char>(1024);

hipGraphExec_t raw_graph_exec = nullptr;

status = hipGraphInstantiate(&raw_graph_exec, graph.get(), nullptr, nullptr, 0);

if(status != hipSuccess)

MIGRAPHX_THROW("Failed: hipGraphInstantiate: " + hip_error(status));

if(raw_graph_exec == nullptr)

MIGRAPHX_THROW("Failed: hipGraphInstantiate returned a null graph exec");

auto graph_exec = share(hip_graph_exec_ptr{raw_graph_exec});

saved_graph = [this, graph, graph_exec, args] {

Copilot · 2026-04-14T21:28:32Z

+        saved_graph = [this, graph, graph_exec, args] {
+            auto status2 = hipGraphLaunch(graph_exec.get(), get_stream().get());
+            if(status2 != hipSuccess)
+                MIGRAPHX_THROW("Failed: hipGraphLaunch: " + hip_error(status2));
+            return args;
+        };
+    }


saved_graph captures this in the lambda ([this, graph, graph_exec, args]). If gpu::context is copied (e.g., via the type-erased migraphx::context clone path), the copied std::function will still hold the original object pointer and can become a use-after-free. Avoid capturing this; instead capture the stream handle (or other required state) by value so the callback remains valid after copies/moves.

Copilot · 2026-04-14T21:28:33Z

+    if(contexts.size() == 1 and enabled(MIGRAPHX_ENABLE_HIP_GRAPH{}))
+    {
+        auto& ctx = contexts.front();
+        auto run  = ctx.get_capture();
+        if(run != nullptr)
+            return run();
+        ctx.start_capture();
+    }


When a capture already exists, program::eval returns run() before handling exec_env.async (wait_for/finish_on) and before applying any MIGRAPHX_TRACE_EVAL behavior. This changes observable semantics for callers using async execution environments or tracing. Consider disabling the hip-graph fast path when exec_env.async or tracing is enabled, or make the cached path honor the same synchronization/tracing behavior.

Copilot · 2026-04-14T21:28:33Z

+    if(contexts.size() == 1 and enabled(MIGRAPHX_ENABLE_HIP_GRAPH{}))
+    {
+        auto& ctx = contexts.front();
+        ctx.end_capture(ret);
+    }


If ctx.start_capture() is called and generic_eval throws (e.g., missing/shape-mismatched parameters), the stream capture is left open and end_capture is never reached, which can break subsequent executions on that stream. This needs RAII/try-catch cleanup to end/abort capture on exceptions (and to only call end_capture when a capture is actually active).

Copilot · 2026-04-14T21:28:33Z

+    p.eval(m); // run once for hip graph capture
+


This warm-up p.eval(m) runs unconditionally and will execute the model twice even when hip-graph capture is disabled, which significantly slows verification and can skew perf numbers. Gate this extra run behind enabled(MIGRAPHX_ENABLE_HIP_GRAPH{}) (and/or only when the compiled target is GPU).

Copilot · 2026-04-14T21:28:33Z

        // No winograd for group convolution
-        if(group > 1)
-            return true;
-        auto w = ins->inputs().at(1)->get_shape();
-        if(w.lens().size() != 4)
-            return true;
-        if(w.lens()[2] != w.lens()[3])
-            return true;
-        return (w.lens()[3] % 3) != 0;
+        return group == 1;
+        // if(group > 1)
+        //     return true;
+        // auto w = ins->inputs().at(1)->get_shape();
+        // if(w.lens().size() != 4)
+        //     return true;
+        // if(w.lens()[2] != w.lens()[3])
+        //     return true;
+        // return (w.lens()[3] % 3) != 0;


is_mlir_conv now unconditionally returns group == 1 in fast mode and leaves the previous heuristic commented out below a return, making the intent hard to interpret (the nearby “No winograd for group convolution” comment is also ambiguous with the new logic). Please either remove the dead commented code or replace it with a clear, active heuristic (and update the comment) so future changes don’t accidentally reintroduce the old behavior.

pfultz2 · 2026-04-14T22:42:55Z

This should be set to draft. As we cannot merge the experimental hip graph.

codecov · 2026-04-15T03:13:29Z

Codecov Report

❌ Patch coverage is 23.18841% with 53 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/simplify_algebra.cpp	31.71%	28 Missing ⚠️
src/include/migraphx/context.hpp	0.00%	20 Missing ⚠️
src/program.cpp	28.57%	5 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4785      +/-   ##
===========================================
+ Coverage    92.29%   92.33%   +0.05%     
===========================================
  Files          580      583       +3     
  Lines        28688    29627     +939     
===========================================
+ Hits         26475    27355     +880     
- Misses        2213     2272      +59

Files with missing lines	Coverage Δ
src/include/migraphx/program.hpp	`100.00% <ø> (ø)`
src/layout_convolution.cpp	`95.65% <100.00%> (-0.12%)`	⬇️
src/program.cpp	`71.57% <28.57%> (-0.30%)`	⬇️
src/include/migraphx/context.hpp	`53.76% <0.00%> (-14.73%)`	⬇️
src/simplify_algebra.cpp	`96.35% <31.71%> (-2.12%)`	⬇️

... and 35 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pfultz2 and others added 11 commits March 31, 2026 17:45

Add code for hip-graph

1230be4

Format

b8052a7

Add env variable to enable hip graph

7c7af24

add extra verify for hip graph

e4aadb8

Fuse conv+conv

cf1b560

Enable for group convolution as well

c2d4a90

Format

b25576d

test layout removal

03c7850

Use miopen for group convolutions

2cf6bd3

remove extra layouts

37a6fce

Add missing type-erasure dispatchers for capture methods in context.hpp

12e991e

ahsan-ca requested a review from kahmed10 April 14, 2026 21:19

ahsan-ca self-assigned this Apr 14, 2026

ahsan-ca requested a review from causten as a code owner April 14, 2026 21:19

Copilot AI review requested due to automatic review settings April 14, 2026 21:19

ahsan-ca added the Perf Improve label Apr 14, 2026

Copilot started reviewing on behalf of ahsan-ca April 14, 2026 21:20 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

ahsan-ca marked this pull request as draft April 15, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve yolov10n Performance#4785

Improve yolov10n Performance#4785
ahsan-ca wants to merge 11 commits intodevelopfrom
investigate_yolov10n_perf_rebase

ahsan-ca commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

pfultz2 commented Apr 14, 2026

Uh oh!

codecov Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ahsan-ca commented Apr 14, 2026

Motivation

Technical Details

Changelog Category

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

pfultz2 commented Apr 14, 2026

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Apr 15, 2026 •

edited

Loading