Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces HIP graph capture support and new algebraic rewrites intended to improve GPU inference performance (notably for yolov10n), along with a couple of GPU-heuristic tweaks.
Changes:
- Extend the (type-erased)
migraphx::contextAPI to support capture lifecycle (start_capture/end_capture/get_capture) and implement it for the GPU context using HIP graphs. - Add a
program::evalfast-path that replays a previously captured HIP graph whenMIGRAPHX_ENABLE_HIP_GRAPHis enabled. - Add/adjust GPU optimization heuristics, including a new convolution-convolution(1x1) fusion in
simplify_algebraand anis_mlir_convpolicy change.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/include/context.hpp | Extends generated context interface with capture-related virtuals. |
| src/include/migraphx/context.hpp | Adds capture hooks to the type-erased context implementation. |
| src/targets/gpu/include/migraphx/gpu/context.hpp | Implements HIP stream capture + graph instantiate/launch and stores a replay callback. |
| src/program.cpp | Uses capture hooks in program::eval to cache and replay a HIP graph. |
| src/include/migraphx/program.hpp | Declares MIGRAPHX_ENABLE_HIP_GRAPH env var. |
| src/targets/gpu/compile_hip_code_object.cpp | Adjusts global sizing logic when HIP graph mode is enabled. |
| src/driver/verify.cpp | Adds an extra warm-up eval to populate the HIP graph capture. |
| src/simplify_algebra.cpp | Adds a new conv->1x1 conv fusion that composes weights via reshape/transpose/dot. |
| src/targets/gpu/fuse_mlir.cpp | Changes MLIR-conv eligibility heuristic in fast mode (and leaves old logic commented). |
| src/layout_convolution.cpp | Simplifies redundant layout removal condition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto mw_dims = wnxn->get_shape().lens(); | ||
| mw_dims[1] *= groups; | ||
|
|
||
| auto w1x1_reshaped = m.insert_instruction( | ||
| ins, | ||
| make_op("reshape", {{"dims", {out_channels, groups, mid_channels / groups}}}), | ||
| w1x1); | ||
| auto w1x1_grouped = m.insert_instruction( | ||
| ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), w1x1_reshaped); | ||
|
|
||
| auto wnxn_reshaped = m.insert_instruction( | ||
| ins, | ||
| make_op("reshape", | ||
| {{"dims", {groups, mid_channels / groups, in_channels_per_group * w_size}}}), | ||
| wnxn); | ||
|
|
||
| auto mw = m.insert_instruction(ins, make_op("dot"), w1x1_grouped, wnxn_reshaped); | ||
| auto mw_transposed = | ||
| m.insert_instruction(ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), mw); | ||
| auto mw_reshaped = | ||
| m.insert_instruction(ins, make_op("reshape", {{"dims", mw_dims}}), mw_transposed); | ||
|
|
||
| auto op = input->get_operator(); | ||
| op.from_value({{"group", 1}}); | ||
| auto conv = m.insert_instruction(ins, op, x_ins, mw_reshaped); |
There was a problem hiding this comment.
find_conv_conv_1x1 builds mw_dims from the first convolution’s weights (wnxn) and only updates mw_dims[1]. This leaves mw_dims[0] as the inner conv’s output channels, but the fused weights produced by the dot have leading dimension out_channels (from the 1x1 conv). If out_channels != mid_channels (common), the final reshape will be invalid (element-count mismatch) and/or the replacement convolution will produce the wrong output shape. Update the fused weight shape so dim-0 matches out_channels (and add shape/divisibility checks such as mid_channels % groups == 0) before inserting reshapes/dot.
| // (x * w1) * w2 => x * (w1 * w2) | ||
| struct find_conv_conv_1x1 | ||
| { | ||
| auto matcher() const | ||
| { | ||
| return conv_1x1( | ||
| match::arg(0)(match::used_once(), match::name("convolution").bind("input"))); | ||
| } | ||
|
|
||
| void apply(module& m, const match::matcher_result& r) const | ||
| { | ||
| auto ins = r.result; | ||
| auto input = r.instructions["input"]; | ||
| auto x_ins = input->inputs().front(); | ||
| auto wnxn = input->inputs()[1]; | ||
| auto w1x1 = ins->inputs()[1]; | ||
|
|
||
| auto out_channels = w1x1->get_shape().lens()[0]; | ||
| auto mid_channels = w1x1->get_shape().lens()[1]; | ||
| auto in_channels_per_group = wnxn->get_shape().lens()[1]; | ||
| auto groups = x_ins->get_shape().lens()[1] / in_channels_per_group; | ||
| auto w_size = std::accumulate(wnxn->get_shape().lens().begin() + 2, | ||
| wnxn->get_shape().lens().end(), | ||
| std::size_t{1}, | ||
| std::multiplies<>{}); | ||
|
|
||
| auto mw_dims = wnxn->get_shape().lens(); | ||
| mw_dims[1] *= groups; | ||
|
|
||
| auto w1x1_reshaped = m.insert_instruction( | ||
| ins, | ||
| make_op("reshape", {{"dims", {out_channels, groups, mid_channels / groups}}}), | ||
| w1x1); | ||
| auto w1x1_grouped = m.insert_instruction( | ||
| ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), w1x1_reshaped); | ||
|
|
||
| auto wnxn_reshaped = m.insert_instruction( | ||
| ins, | ||
| make_op("reshape", | ||
| {{"dims", {groups, mid_channels / groups, in_channels_per_group * w_size}}}), | ||
| wnxn); | ||
|
|
||
| auto mw = m.insert_instruction(ins, make_op("dot"), w1x1_grouped, wnxn_reshaped); | ||
| auto mw_transposed = | ||
| m.insert_instruction(ins, make_op("transpose", {{"permutation", {1, 0, 2}}}), mw); | ||
| auto mw_reshaped = | ||
| m.insert_instruction(ins, make_op("reshape", {{"dims", mw_dims}}), mw_transposed); | ||
|
|
||
| auto op = input->get_operator(); | ||
| op.from_value({{"group", 1}}); | ||
| auto conv = m.insert_instruction(ins, op, x_ins, mw_reshaped); | ||
| m.replace_instruction(ins, conv); | ||
| } |
There was a problem hiding this comment.
This new convolution+1x1-convolution fusion introduces new behavior in simplify_algebra, but there’s no corresponding test coverage validating the rewrite (especially for the common case where the 1x1 conv changes channel count and/or where the first conv is grouped/depthwise). Please add a unit test in test/simplify_algebra_test.cpp that exercises the rewrite and checks the optimized module is equivalent and preserves output shapes.
| auto graph = share(hip_graph_ptr{raw_graph}); | ||
| if(status != hipSuccess) | ||
| MIGRAPHX_THROW("Failed: hipStreamEndCapture: " + hip_error(status)); | ||
|
|
||
| // auto log = make_shared_array<char>(1024); | ||
| hipGraphExec_t raw_graph_exec = nullptr; | ||
| status = hipGraphInstantiate(&raw_graph_exec, graph.get(), nullptr, nullptr, 0); | ||
| auto graph_exec = share(hip_graph_exec_ptr{raw_graph_exec}); | ||
| if(status != hipSuccess) | ||
| MIGRAPHX_THROW("Failed: hipGraphInstantiate: " + hip_error(status)); | ||
| saved_graph = [this, graph, graph_exec, args] { |
There was a problem hiding this comment.
end_capture wraps raw_graph in hip_graph_ptr (and raw_graph_exec in hip_graph_exec_ptr) before checking the HIP return status. If hipStreamEndCapture/hipGraphInstantiate fails, these RAII deleters may run on an invalid/null handle during stack unwinding. Construct the managed pointers only after confirming status == hipSuccess (and consider explicitly handling a null returned graph).
| auto graph = share(hip_graph_ptr{raw_graph}); | |
| if(status != hipSuccess) | |
| MIGRAPHX_THROW("Failed: hipStreamEndCapture: " + hip_error(status)); | |
| // auto log = make_shared_array<char>(1024); | |
| hipGraphExec_t raw_graph_exec = nullptr; | |
| status = hipGraphInstantiate(&raw_graph_exec, graph.get(), nullptr, nullptr, 0); | |
| auto graph_exec = share(hip_graph_exec_ptr{raw_graph_exec}); | |
| if(status != hipSuccess) | |
| MIGRAPHX_THROW("Failed: hipGraphInstantiate: " + hip_error(status)); | |
| saved_graph = [this, graph, graph_exec, args] { | |
| if(status != hipSuccess) | |
| MIGRAPHX_THROW("Failed: hipStreamEndCapture: " + hip_error(status)); | |
| if(raw_graph == nullptr) | |
| MIGRAPHX_THROW("Failed: hipStreamEndCapture returned a null graph"); | |
| auto graph = share(hip_graph_ptr{raw_graph}); | |
| // auto log = make_shared_array<char>(1024); | |
| hipGraphExec_t raw_graph_exec = nullptr; | |
| status = hipGraphInstantiate(&raw_graph_exec, graph.get(), nullptr, nullptr, 0); | |
| if(status != hipSuccess) | |
| MIGRAPHX_THROW("Failed: hipGraphInstantiate: " + hip_error(status)); | |
| if(raw_graph_exec == nullptr) | |
| MIGRAPHX_THROW("Failed: hipGraphInstantiate returned a null graph exec"); | |
| auto graph_exec = share(hip_graph_exec_ptr{raw_graph_exec}); | |
| saved_graph = [this, graph, graph_exec, args] { |
| saved_graph = [this, graph, graph_exec, args] { | ||
| auto status2 = hipGraphLaunch(graph_exec.get(), get_stream().get()); | ||
| if(status2 != hipSuccess) | ||
| MIGRAPHX_THROW("Failed: hipGraphLaunch: " + hip_error(status2)); | ||
| return args; | ||
| }; | ||
| } |
There was a problem hiding this comment.
saved_graph captures this in the lambda ([this, graph, graph_exec, args]). If gpu::context is copied (e.g., via the type-erased migraphx::context clone path), the copied std::function will still hold the original object pointer and can become a use-after-free. Avoid capturing this; instead capture the stream handle (or other required state) by value so the callback remains valid after copies/moves.
| if(contexts.size() == 1 and enabled(MIGRAPHX_ENABLE_HIP_GRAPH{})) | ||
| { | ||
| auto& ctx = contexts.front(); | ||
| auto run = ctx.get_capture(); | ||
| if(run != nullptr) | ||
| return run(); | ||
| ctx.start_capture(); | ||
| } |
There was a problem hiding this comment.
When a capture already exists, program::eval returns run() before handling exec_env.async (wait_for/finish_on) and before applying any MIGRAPHX_TRACE_EVAL behavior. This changes observable semantics for callers using async execution environments or tracing. Consider disabling the hip-graph fast path when exec_env.async or tracing is enabled, or make the cached path honor the same synchronization/tracing behavior.
| if(contexts.size() == 1 and enabled(MIGRAPHX_ENABLE_HIP_GRAPH{})) | ||
| { | ||
| auto& ctx = contexts.front(); | ||
| ctx.end_capture(ret); | ||
| } |
There was a problem hiding this comment.
If ctx.start_capture() is called and generic_eval throws (e.g., missing/shape-mismatched parameters), the stream capture is left open and end_capture is never reached, which can break subsequent executions on that stream. This needs RAII/try-catch cleanup to end/abort capture on exceptions (and to only call end_capture when a capture is actually active).
| p.eval(m); // run once for hip graph capture | ||
|
|
There was a problem hiding this comment.
This warm-up p.eval(m) runs unconditionally and will execute the model twice even when hip-graph capture is disabled, which significantly slows verification and can skew perf numbers. Gate this extra run behind enabled(MIGRAPHX_ENABLE_HIP_GRAPH{}) (and/or only when the compiled target is GPU).
| // No winograd for group convolution | ||
| if(group > 1) | ||
| return true; | ||
| auto w = ins->inputs().at(1)->get_shape(); | ||
| if(w.lens().size() != 4) | ||
| return true; | ||
| if(w.lens()[2] != w.lens()[3]) | ||
| return true; | ||
| return (w.lens()[3] % 3) != 0; | ||
| return group == 1; | ||
| // if(group > 1) | ||
| // return true; | ||
| // auto w = ins->inputs().at(1)->get_shape(); | ||
| // if(w.lens().size() != 4) | ||
| // return true; | ||
| // if(w.lens()[2] != w.lens()[3]) | ||
| // return true; | ||
| // return (w.lens()[3] % 3) != 0; |
There was a problem hiding this comment.
is_mlir_conv now unconditionally returns group == 1 in fast mode and leaves the previous heuristic commented out below a return, making the intent hard to interpret (the nearby “No winograd for group convolution” comment is also ambiguous with the new logic). Please either remove the dead commented code or replace it with a clear, active heuristic (and update the comment) so future changes don’t accidentally reintroduce the old behavior.
|
This should be set to draft. As we cannot merge the experimental hip graph. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #4785 +/- ##
===========================================
+ Coverage 92.29% 92.33% +0.05%
===========================================
Files 580 583 +3
Lines 28688 29627 +939
===========================================
+ Hits 26475 27355 +880
- Misses 2213 2272 +59
🚀 New features to boost your workflow:
|
Motivation
Improve yolov10n Performance
Technical Details
Changelog Category
Add a
CHANGELOG.mdentry for any option other thanNot Applicable