Skip to content

Improve yolov10n Performance#4785

Draft
ahsan-ca wants to merge 11 commits intodevelopfrom
investigate_yolov10n_perf_rebase
Draft

Improve yolov10n Performance#4785
ahsan-ca wants to merge 11 commits intodevelopfrom
investigate_yolov10n_perf_rebase

Conversation

@ahsan-ca
Copy link
Copy Markdown
Contributor

Motivation

Improve yolov10n Performance

Technical Details

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

@ahsan-ca ahsan-ca requested a review from kahmed10 April 14, 2026 21:19
@ahsan-ca ahsan-ca self-assigned this Apr 14, 2026
@ahsan-ca ahsan-ca requested a review from causten as a code owner April 14, 2026 21:19
Copilot AI review requested due to automatic review settings April 14, 2026 21:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces HIP graph capture support and new algebraic rewrites intended to improve GPU inference performance (notably for yolov10n), along with a couple of GPU-heuristic tweaks.

Changes:

  • Extend the (type-erased) migraphx::context API to support capture lifecycle (start_capture/end_capture/get_capture) and implement it for the GPU context using HIP graphs.
  • Add a program::eval fast-path that replays a previously captured HIP graph when MIGRAPHX_ENABLE_HIP_GRAPH is enabled.
  • Add/adjust GPU optimization heuristics, including a new convolution-convolution(1x1) fusion in simplify_algebra and an is_mlir_conv policy change.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tools/include/context.hpp Extends generated context interface with capture-related virtuals.
src/include/migraphx/context.hpp Adds capture hooks to the type-erased context implementation.
src/targets/gpu/include/migraphx/gpu/context.hpp Implements HIP stream capture + graph instantiate/launch and stores a replay callback.
src/program.cpp Uses capture hooks in program::eval to cache and replay a HIP graph.
src/include/migraphx/program.hpp Declares MIGRAPHX_ENABLE_HIP_GRAPH env var.
src/targets/gpu/compile_hip_code_object.cpp Adjusts global sizing logic when HIP graph mode is enabled.
src/driver/verify.cpp Adds an extra warm-up eval to populate the HIP graph capture.
src/simplify_algebra.cpp Adds a new conv->1x1 conv fusion that composes weights via reshape/transpose/dot.
src/targets/gpu/fuse_mlir.cpp Changes MLIR-conv eligibility heuristic in fast mode (and leaves old logic commented).
src/layout_convolution.cpp Simplifies redundant layout removal condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/simplify_algebra.cpp
Comment on lines +1139 to +1163
auto mw_dims = wnxn->get_shape().lens();
mw_dims[1] *= groups;

auto w1x1_reshaped = m.insert_instruction(
ins,
make_op("reshape", {{"dims", {out_channels, groups, mid_channels / groups}}}),
w1x1);
auto w1x1_grouped = m.insert_instruction(
ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), w1x1_reshaped);

auto wnxn_reshaped = m.insert_instruction(
ins,
make_op("reshape",
{{"dims", {groups, mid_channels / groups, in_channels_per_group * w_size}}}),
wnxn);

auto mw = m.insert_instruction(ins, make_op("dot"), w1x1_grouped, wnxn_reshaped);
auto mw_transposed =
m.insert_instruction(ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), mw);
auto mw_reshaped =
m.insert_instruction(ins, make_op("reshape", {{"dims", mw_dims}}), mw_transposed);

auto op = input->get_operator();
op.from_value({{"group", 1}});
auto conv = m.insert_instruction(ins, op, x_ins, mw_reshaped);
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_conv_conv_1x1 builds mw_dims from the first convolution’s weights (wnxn) and only updates mw_dims[1]. This leaves mw_dims[0] as the inner conv’s output channels, but the fused weights produced by the dot have leading dimension out_channels (from the 1x1 conv). If out_channels != mid_channels (common), the final reshape will be invalid (element-count mismatch) and/or the replacement convolution will produce the wrong output shape. Update the fused weight shape so dim-0 matches out_channels (and add shape/divisibility checks such as mid_channels % groups == 0) before inserting reshapes/dot.

Copilot uses AI. Check for mistakes.
Comment thread src/simplify_algebra.cpp
Comment on lines +1113 to +1165
// (x * w1) * w2 => x * (w1 * w2)
struct find_conv_conv_1x1
{
auto matcher() const
{
return conv_1x1(
match::arg(0)(match::used_once(), match::name("convolution").bind("input")));
}

void apply(module& m, const match::matcher_result& r) const
{
auto ins = r.result;
auto input = r.instructions["input"];
auto x_ins = input->inputs().front();
auto wnxn = input->inputs()[1];
auto w1x1 = ins->inputs()[1];

auto out_channels = w1x1->get_shape().lens()[0];
auto mid_channels = w1x1->get_shape().lens()[1];
auto in_channels_per_group = wnxn->get_shape().lens()[1];
auto groups = x_ins->get_shape().lens()[1] / in_channels_per_group;
auto w_size = std::accumulate(wnxn->get_shape().lens().begin() + 2,
wnxn->get_shape().lens().end(),
std::size_t{1},
std::multiplies<>{});

auto mw_dims = wnxn->get_shape().lens();
mw_dims[1] *= groups;

auto w1x1_reshaped = m.insert_instruction(
ins,
make_op("reshape", {{"dims", {out_channels, groups, mid_channels / groups}}}),
w1x1);
auto w1x1_grouped = m.insert_instruction(
ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), w1x1_reshaped);

auto wnxn_reshaped = m.insert_instruction(
ins,
make_op("reshape",
{{"dims", {groups, mid_channels / groups, in_channels_per_group * w_size}}}),
wnxn);

auto mw = m.insert_instruction(ins, make_op("dot"), w1x1_grouped, wnxn_reshaped);
auto mw_transposed =
m.insert_instruction(ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), mw);
auto mw_reshaped =
m.insert_instruction(ins, make_op("reshape", {{"dims", mw_dims}}), mw_transposed);

auto op = input->get_operator();
op.from_value({{"group", 1}});
auto conv = m.insert_instruction(ins, op, x_ins, mw_reshaped);
m.replace_instruction(ins, conv);
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new convolution+1x1-convolution fusion introduces new behavior in simplify_algebra, but there’s no corresponding test coverage validating the rewrite (especially for the common case where the 1x1 conv changes channel count and/or where the first conv is grouped/depthwise). Please add a unit test in test/simplify_algebra_test.cpp that exercises the rewrite and checks the optimized module is equivalent and preserves output shapes.

Copilot uses AI. Check for mistakes.
Comment on lines +398 to +408
auto graph = share(hip_graph_ptr{raw_graph});
if(status != hipSuccess)
MIGRAPHX_THROW("Failed: hipStreamEndCapture: " + hip_error(status));

// auto log = make_shared_array<char>(1024);
hipGraphExec_t raw_graph_exec = nullptr;
status = hipGraphInstantiate(&raw_graph_exec, graph.get(), nullptr, nullptr, 0);
auto graph_exec = share(hip_graph_exec_ptr{raw_graph_exec});
if(status != hipSuccess)
MIGRAPHX_THROW("Failed: hipGraphInstantiate: " + hip_error(status));
saved_graph = [this, graph, graph_exec, args] {
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end_capture wraps raw_graph in hip_graph_ptr (and raw_graph_exec in hip_graph_exec_ptr) before checking the HIP return status. If hipStreamEndCapture/hipGraphInstantiate fails, these RAII deleters may run on an invalid/null handle during stack unwinding. Construct the managed pointers only after confirming status == hipSuccess (and consider explicitly handling a null returned graph).

Suggested change
auto graph = share(hip_graph_ptr{raw_graph});
if(status != hipSuccess)
MIGRAPHX_THROW("Failed: hipStreamEndCapture: " + hip_error(status));
// auto log = make_shared_array<char>(1024);
hipGraphExec_t raw_graph_exec = nullptr;
status = hipGraphInstantiate(&raw_graph_exec, graph.get(), nullptr, nullptr, 0);
auto graph_exec = share(hip_graph_exec_ptr{raw_graph_exec});
if(status != hipSuccess)
MIGRAPHX_THROW("Failed: hipGraphInstantiate: " + hip_error(status));
saved_graph = [this, graph, graph_exec, args] {
if(status != hipSuccess)
MIGRAPHX_THROW("Failed: hipStreamEndCapture: " + hip_error(status));
if(raw_graph == nullptr)
MIGRAPHX_THROW("Failed: hipStreamEndCapture returned a null graph");
auto graph = share(hip_graph_ptr{raw_graph});
// auto log = make_shared_array<char>(1024);
hipGraphExec_t raw_graph_exec = nullptr;
status = hipGraphInstantiate(&raw_graph_exec, graph.get(), nullptr, nullptr, 0);
if(status != hipSuccess)
MIGRAPHX_THROW("Failed: hipGraphInstantiate: " + hip_error(status));
if(raw_graph_exec == nullptr)
MIGRAPHX_THROW("Failed: hipGraphInstantiate returned a null graph exec");
auto graph_exec = share(hip_graph_exec_ptr{raw_graph_exec});
saved_graph = [this, graph, graph_exec, args] {

Copilot uses AI. Check for mistakes.
Comment on lines +408 to +414
saved_graph = [this, graph, graph_exec, args] {
auto status2 = hipGraphLaunch(graph_exec.get(), get_stream().get());
if(status2 != hipSuccess)
MIGRAPHX_THROW("Failed: hipGraphLaunch: " + hip_error(status2));
return args;
};
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saved_graph captures this in the lambda ([this, graph, graph_exec, args]). If gpu::context is copied (e.g., via the type-erased migraphx::context clone path), the copied std::function will still hold the original object pointer and can become a use-after-free. Avoid capturing this; instead capture the stream handle (or other required state) by value so the callback remains valid after copies/moves.

Copilot uses AI. Check for mistakes.
Comment thread src/program.cpp
Comment on lines +593 to +600
if(contexts.size() == 1 and enabled(MIGRAPHX_ENABLE_HIP_GRAPH{}))
{
auto& ctx = contexts.front();
auto run = ctx.get_capture();
if(run != nullptr)
return run();
ctx.start_capture();
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a capture already exists, program::eval returns run() before handling exec_env.async (wait_for/finish_on) and before applying any MIGRAPHX_TRACE_EVAL behavior. This changes observable semantics for callers using async execution environments or tracing. Consider disabling the hip-graph fast path when exec_env.async or tracing is enabled, or make the cached path honor the same synchronization/tracing behavior.

Copilot uses AI. Check for mistakes.
Comment thread src/program.cpp
Comment on lines +676 to +680
if(contexts.size() == 1 and enabled(MIGRAPHX_ENABLE_HIP_GRAPH{}))
{
auto& ctx = contexts.front();
ctx.end_capture(ret);
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ctx.start_capture() is called and generic_eval throws (e.g., missing/shape-mismatched parameters), the stream capture is left open and end_capture is never reached, which can break subsequent executions on that stream. This needs RAII/try-catch cleanup to end/abort capture on exceptions (and to only call end_capture when a capture is actually active).

Copilot uses AI. Check for mistakes.
Comment thread src/driver/verify.cpp
Comment on lines +141 to +142
p.eval(m); // run once for hip graph capture

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warm-up p.eval(m) runs unconditionally and will execute the model twice even when hip-graph capture is disabled, which significantly slows verification and can skew perf numbers. Gate this extra run behind enabled(MIGRAPHX_ENABLE_HIP_GRAPH{}) (and/or only when the compiled target is GPU).

Copilot uses AI. Check for mistakes.
Comment on lines 374 to +383
// No winograd for group convolution
if(group > 1)
return true;
auto w = ins->inputs().at(1)->get_shape();
if(w.lens().size() != 4)
return true;
if(w.lens()[2] != w.lens()[3])
return true;
return (w.lens()[3] % 3) != 0;
return group == 1;
// if(group > 1)
// return true;
// auto w = ins->inputs().at(1)->get_shape();
// if(w.lens().size() != 4)
// return true;
// if(w.lens()[2] != w.lens()[3])
// return true;
// return (w.lens()[3] % 3) != 0;
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_mlir_conv now unconditionally returns group == 1 in fast mode and leaves the previous heuristic commented out below a return, making the intent hard to interpret (the nearby “No winograd for group convolution” comment is also ambiguous with the new logic). Please either remove the dead commented code or replace it with a clear, active heuristic (and update the comment) so future changes don’t accidentally reintroduce the old behavior.

Copilot uses AI. Check for mistakes.
@pfultz2
Copy link
Copy Markdown
Collaborator

pfultz2 commented Apr 14, 2026

This should be set to draft. As we cannot merge the experimental hip graph.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 23.18841% with 53 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/simplify_algebra.cpp 31.71% 28 Missing ⚠️
src/include/migraphx/context.hpp 0.00% 20 Missing ⚠️
src/program.cpp 28.57% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4785      +/-   ##
===========================================
+ Coverage    92.29%   92.33%   +0.05%     
===========================================
  Files          580      583       +3     
  Lines        28688    29627     +939     
===========================================
+ Hits         26475    27355     +880     
- Misses        2213     2272      +59     
Files with missing lines Coverage Δ
src/include/migraphx/program.hpp 100.00% <ø> (ø)
src/layout_convolution.cpp 95.65% <100.00%> (-0.12%) ⬇️
src/program.cpp 71.57% <28.57%> (-0.30%) ⬇️
src/include/migraphx/context.hpp 53.76% <0.00%> (-14.73%) ⬇️
src/simplify_algebra.cpp 96.35% <31.71%> (-2.12%) ⬇️

... and 35 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ahsan-ca ahsan-ca marked this pull request as draft April 15, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants