Conversation
|
!test |
|
Review updated until commit 42937d3 Description
|
| Relevant files | |||
|---|---|---|---|
| Enhancement |
| ||
| Error handling |
| ||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Error Handling Regression
|
Test failures
-
(Medium, 34)
NVFuser TMA load & inner-reduction tests hitting internal assertions (validator_utils.cpp, indexing.cpp) across multiple runnersTest Name GB200 H100 Source TMASimpleLdstTest.Load/1D_128B___half ❌ ❌ Link TMASimpleLdstTest.Load/1D_128B_float ❌ ❌ Link TMASimpleLdstTest.Load/1D_32B___half ❌ ❌ Link TMASimpleLdstTest.Load/1D_32B_float ❌ ❌ Link TMASimpleLdstTest.Load/1D_64B___half ❌ ❌ Link TMASimpleLdstTest.Load/1D_64B_float ❌ ❌ Link TmaInnerReductionManualTest.Basic/ndim_2_inner_size_1048576 ❌ ❌ Link TmaInnerReductionManualTest.Basic/ndim_2_inner_size_131072 ❌ ❌ Link TmaInnerReductionManualTest.Basic/ndim_2_inner_size_524288 ❌ ❌ Link TmaInnerReductionManualTest.Basic/ndim_2_inner_size_65536 ❌ ❌ Link ... with 7 more test failures omitted. Check internal logs.
|
!test |
|
@naoyam while I'm cleaning things up and verifying tests, do you think it's moving to the right direction? |
|
!test |
|
Looks good overall. |
|
I'm running into some interesting test failures. One of them is an validation error: The symptom is around this TensorView The new code maps This is mathematically correct because these two IterDomains can share the same index -- the index is 0 all the time. However, codegen doesn't seem to like the mapping. Before I throw more if-elses at it, what's the right contract so people can DbC? cc @naoyam |
|
Can you show the diff of generated codes? I'm guessing something isn't working around predicates. |
cc @naoyam |
|
As @naoyam requested: |
|
I'd like to see the diff result comparing the generated kernels. Please run the test with |
|
NVFUSER_DUMP=cuda_kernel without this PR: I think you are right about predication: cc @naoyam |
|
Copying messages from @naoyam for https://abseil.io/resources/swe-book/html/ch03.html I looked into the issue. The issue happens due to the predication for non-divisible splits. https://github.com/NVIDIA/Fuser/blob/main/csrc/id_model/indexing.cpp#L778 IIRC, Xiang had some writeup. https://github.com/NVIDIA/Fuser/blob/main/doc/reading/divisibility-of-split.md In this case, T2 has a non-divisible split with is17: iS17 is split by 8, which effectively expands the domain by a factor of 8, and so we would need to make sure indexing would not go beyond the original extent of iS17, which is just 1. getNonDivisibleIdsToPredicate used here returns iS17 in this case. In main, this line creates this predicate: ( ( ( threadIdx.x / 8 ) / 8 ) < 1 ) Now, the PR adds another mapping: iS17 and iS19 . When we do the traversal, iS19 simply gets assigned with index value of zero. That is because of Merge: iS19{1} and iS25{8} -> iS27{8}. Here, iS25 and iS27 are mapped as part of the almost-exact mappings, so we simply forward the assigned index of iS27 to iS25, and for iS19, I think we simply assign zero (I need to confirm this). Since iS19 gets zero, so does iS17. This results in the non-divisible split predicate of 0 < 1, instead of ( ( ( threadIdx.x / 8 ) / 8 ) < 1 ) . As a result, since 0 < 1 is always true, the resulting code doesn't get any predicate for the non-divisible split. The almost-exact mapping is used for indexing traversal, so its mapping needs to take indexing equality into consideration. Even if two iter domains have the same extent, it doesn't automatically mean they should use the same index. In this case, for the purpose of indexing, I'd question if iS17 and iS19 should be mapped. |
|
!test |
|
@naoyam this is ready for review |
|
!test |
Greptile SummaryThis PR extends the almost-exact graph construction in Key changes:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[buildAlmostExactGraph] --> B[Copy EXACT graph]
B --> C[setUnmappable on root/logical/loop domains]
C --> D[Collect trivially-mapped ID pairs\nvia getTriviallyMappedIds]
D --> E[Apply trivial mappings\nalmost_exact_graph.mapVals]
E --> F[mapDivisibleSplits]
subgraph mapDivisibleSplits
F1[For each ValGroup root] --> F2[Find divisible splits of root]
F2 --> F3[Follow outer output → find\ndivisible splits of outer\nCollect outermost_grand groups]
F3 --> F4[For each outermost_grand\ncompare extent to outer outputs\nof all splits of root]
F4 --> F5{Same extent\nAND both splits\ndivisible?}
F5 -- Yes --> F6[Queue ids_to_map pair]
F5 -- No --> F4
F6 --> F7[Apply all deferred mapVals]
end
F --> G[validateConsistency]
G --> H[assertNoSelfMapping]
Last reviewed commit: 40a2cc6 |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
| for (const ValGroup& outermost_grand : outermost_grands) { | ||
| Val* extent_of_grand = | ||
| outermost_grand->front()->as<IterDomain>()->extent(); | ||
|
|
||
| for (const ExprGroup& use_of_root : uses_of_root) { | ||
| auto* split = dynamic_cast<Split*>(use_of_root->front()); | ||
| if (split == nullptr || !is_divisible(split)) { | ||
| continue; | ||
| } | ||
|
|
||
| const ValGroup& outer = graph.toGroup(split->outer()); | ||
| if (outer->front()->as<IterDomain>()->extent()->sameAs( | ||
| extent_of_grand)) { | ||
| ids_to_map.emplace_back(outermost_grand->front(), outer->front()); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Duplicate mapping entries possible
When multiple split-split paths from the same root share the same outermost_grand, that group gets added to outermost_grands more than once. The second loop will then emit duplicate (id1, id2) pairs into ids_to_map for every repeated grand. mapVals is idempotent, so correctness is preserved, but the deduplication of outermost_grands (e.g. using an UnorderedSetOfValGroup rather than a std::vector) would prevent the redundant work and keep ids_to_map minimal.
| for (const ValGroup& outermost_grand : outermost_grands) { | ||
| Val* extent_of_grand = | ||
| outermost_grand->front()->as<IterDomain>()->extent(); | ||
|
|
||
| for (const ExprGroup& use_of_root : uses_of_root) { | ||
| auto* split = dynamic_cast<Split*>(use_of_root->front()); | ||
| if (split == nullptr || !is_divisible(split)) { | ||
| continue; | ||
| } | ||
|
|
||
| const ValGroup& outer = graph.toGroup(split->outer()); | ||
| if (outer->front()->as<IterDomain>()->extent()->sameAs( | ||
| extent_of_grand)) { | ||
| ids_to_map.emplace_back(outermost_grand->front(), outer->front()); |
There was a problem hiding this comment.
Second loop can self-map outermost_grand through its own ancestor
The second loop re-scans every use of root and maps outer->front() to outermost_grand->front() whenever their extents agree. Nothing prevents that outer from being the very outer that feeds into split1 to produce outermost_grand — i.e. the direct parent of outermost_grand in the split chain.
In practice this case cannot occur when split1.factor > 1 (the parent has a strictly larger extent), but when split1.factor == 1 (a degenerate, length-1 inner split), outermost_grand.extent == outer.extent. Then ids_to_map receives a pair (outermost_grand->front(), outer->front()) that maps the outer output of split1 to its own input. mapVals will not fire because is_divisible checks split1->isDivisible(), and a factor-1 split is always divisible, so the pair would be queued.
A guard comparing the two ValGroup pointers before emitting the pair would make the intent explicit and protect against future degenerate fusions:
if (outer != outermost_grand &&
outer->front()->as<IterDomain>()->extent()->sameAs(extent_of_grand)) {
ids_to_map.emplace_back(outermost_grand->front(), outer->front());
}| const ExprGroups& ValGraph::getUses(const ValGroup& val_group) const { | ||
| NVF_ERROR(val_group, "Nullptr not allowed"); | ||
|
|
||
| static const ExprGroups empty_expr_groups; | ||
| const auto it = unique_uses_.find(val_group); | ||
| NVF_ERROR( | ||
| it != unique_uses_.end(), | ||
| "Use group not found for ", | ||
| nvfuser::toString(val_group)); | ||
| if (it == unique_uses_.end()) { | ||
| return empty_expr_groups; | ||
| } | ||
| return it->second; | ||
| } |
There was a problem hiding this comment.
Behavioral change silently relaxes a previous invariant
The old implementation treated a missing val_group entry in unique_uses_ as a hard error (via NVF_ERROR). The new implementation silently returns an empty set. While this is functionally required by mapDivisibleSplits (which calls getUses on leaf nodes that have no entries), it also removes the diagnostic for callers that previously relied on the error to detect graphs built incorrectly. By contrast, getDefinitions still throws on a missing entry.
Consider whether a comment, or a separate hasUses()/tryGetUses() accessor, would make the relaxed contract explicit without silently hiding misuse.
|
!test |
A spin-off from #4404
For #3987