Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
9434532
Benchmarking guide
Dec 17, 2025
7a23bb0
fix typo
Dec 17, 2025
35d8d22
fix
Dec 17, 2025
7566e41
Benchmarking guide
Dec 17, 2025
57d0ea8
Benchmarking guide
Dec 17, 2025
f9a7b09
Benchmarking guide
Dec 17, 2025
7deed13
revision
Jan 4, 2026
c7eeb66
revision
Jan 4, 2026
c349dd1
Add small clarifications and explanations to benchmarking guide
julian-reed Mar 9, 2026
dd68f3c
Eval Unit Tests for Adversarial Eval Testing (#82)
bkal01 Dec 19, 2025
d7acadd
fix exclusive cumsum calculation (#109)
bkal01 Dec 24, 2025
9246131
Migrating to UV for Dependency (#112)
PaliC Dec 27, 2025
e0b1140
Package Metadata / src path update (#114)
simonguozirui Dec 27, 2025
a2265ff
infra updates to enable modal-based leaderboard(#100)
pythonomar22 Dec 31, 2025
f28d9c4
level 1-97 problem update: remove device and dtype from sdpa tensors …
taras-sereda Jan 2, 2026
056a8d0
Simplified Thunderkittens Port (#107)
Willy-Chan Jan 3, 2026
e459ef2
Static Kernel Code Checker (#110)
simonguozirui Jan 6, 2026
cfd1609
Dataset Object (#95)
pythonomar22 Jan 7, 2026
27d66e9
Enabling NCU Metric Profiling via Pythonic API (#105)
simonguozirui Jan 8, 2026
28dc8dd
Dependency Updates + Separate Reference Timing (#127)
simonguozirui Jan 20, 2026
42cc2fa
Add HIP backend (#135) for AMD GPUs with AMD folks
amd-asalykov Feb 27, 2026
2dcab5e
update all legacy python commands to UV + document integration (#143)
SebastianFisher Mar 5, 2026
8a41bc9
Make runnable after changes to main
julian-reed Mar 23, 2026
c79621f
Apply Sahan's suggestions
julian-reed Mar 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-api03-...

# Google Gemini
GEMINI_API_KEY=...
GEMINI_API_KEY=

# DeepSeek
DEEPSEEK_API_KEY=sk-...
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@ cache/*
!results/timing/
.env
_build_cache/
uv.lock
CLAUDE.md
57 changes: 57 additions & 0 deletions EVAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Evaluation
[WIP] More notes on Benchmarking Guide
To be updated more comprehensively with the benchmarking guide (ongoing PRs) & blog that we have been working on this quarter.

You should be **extra CAREFUL!** , be always paranoid about suspiciously good results — kernel engineers and existing compilers are already pretty good, so a >2x speedup for anything is highly unlikely.


> “if you beat cudnn by more than 10%, think again” --
from [itsclivetime](https://x.com/itsclivetime/status/1992155951630307633?s=46)


If the model can reward hack, it will find ways to reward hack! This can especially happen during RL training or evolutionary search.

Check out resources here:
- KernelBench [v0.1 Release](https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/)
- Cognition and Stanford's [Kevin](https://arxiv.org/abs/2507.11948) project on various hacking behaviors observed in RL training
- Jiwei Li's awesome [blogpost 1](https://deep-reinforce.com/defense_kernel_hack.html) and [blogpost 2](https://deep-reinforce.com/correctness_check.html) on Hacks and Defenses in Automatic GPU Kernel Generations

Our ongoing blogpost and PRs try to systematize and list out these behaviors and provide tests, detection, and mitigation toolings.

**Disclaimer**: KernelBench is an open-source evaluation framework. Due to limited bandwidth, the KernelBench team does not inspect, validate, or endorse any third-party kernels or reported results. Users are welcome to use the software infrastructure for evaluation, but should independently verify all results.


## Methodology
More on that coming.

To ensure **consistency and reproducibility**, we recommend using `modal` and we have provided / are adding more various modal cloud functions to standardize the evaluation environment.

### Correctness
More coming. We also want to highlight community effort such as [BackendBench](https://www.youtube.com/watch?v=BTfjdyZOKww).

### Performance
We highly recommend watching this [lecture](https://www.youtube.com/watch?v=1i7dxoAfKOU) from GPU mode on kernel profiling.

We have (and continue to) implement various approaches to conduct kernel timing to understand the tradeoffs.

Check out `timing.py` to see available timing methods and `src/unit_tests/test_eval_timing.py` to test out various timing methods (including leveraging `cuda_event` marker, Triton `do_bench`, `host_time` E2E time). @palic and team is working on a blogpost explaining the different tradeoffs soon.

### Profiling
We have experimental profiling support leveraging NVIDIA NCU in `profile.py`.

### Checkers
There are potentially many ways model might reward hack and we would like to catch the known ways through checkers [experimental and WIP]. We start with `kernel_static_checker.py`, which is a regex-based checker on the genenrated code against set of rules. We plan to add AST-based, LM-as-a-judge, and more runtime checks in the future. We welcome suggestions and contributions here.

### Unit Tests with Adversarial Examples
We've included some unit tests for the eval script in `src/unit_tests/test_eval_adversarial.py`. These tests run adversarial kernels (see `src/unit_tests/test_kernels/`) that contain examples of reward hacking that we've seen from LLMs and ensures that the eval script catches them, either by failing their correctness checks or flagging them for excessive speedups. Examples include:
- Reusing computations cached during the PyTorch reference
- Modifying inputs to cheat correctness checks
- Moving computation to a non-default CUDA stream

We will continue to add more tests as we explore additional adversarial scenarios.


Note: KernelBench is an ongoing open-source effort — please help us with issues and PRs!


Shoutout to @bkal01, @palic, @miru_why, @ngc92, @itsclivetime, for their suggestions and feedback.
2 changes: 2 additions & 0 deletions KernelBench/changelog/v0.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Please refer to KernelBench v0.1 release note
https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/
5 changes: 5 additions & 0 deletions KernelBench/changelog/v0.2
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Ongoing Effort

Updated Level1/12_Matmul_with_diagonal_matrices_.py - More efficient PyTorch implementation
Updated Level1/92_cumsum_exclusive.py - Fix exclusive cumsum implementation
Updated Level1/97_ScaledDotProductAttention.py - remove device and precision definition setting, harness should handle it
4 changes: 3 additions & 1 deletion KernelBench/level1/12_Matmul_with_diagonal_matrices_.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@ def forward(self, A, B):
Returns:
torch.Tensor: The result of the matrix multiplication. Shape: (N, M).
"""
return torch.diag(A) @ B
# Logically equivalent to torch.diag(A) @ B
# more efficient as no need to materialize a full N×N matrix
return A.unsqueeze(1) * B

M = 4096
N = 4096
Expand Down
4 changes: 2 additions & 2 deletions KernelBench/level1/92_cumsum_exclusive.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ def __init__(self, dim):
self.dim = dim

def forward(self, x):
exclusive_cumsum = torch.cat((torch.zeros_like(x.select(self.dim, 0).unsqueeze(self.dim)), x), dim=self.dim)[:-1]
return torch.cumsum(exclusive_cumsum, dim=self.dim)
cumsum = torch.cumsum(x.narrow(dim=self.dim, start=0, length=x.size(self.dim)-1), dim=self.dim)
return torch.cat((torch.zeros_like(x.select(self.dim, 0).unsqueeze(self.dim)), cumsum), dim=self.dim)

batch_size = 32768
input_shape = (32768,)
Expand Down
6 changes: 3 additions & 3 deletions KernelBench/level1/97_ScaledDotProductAttention.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ def forward(self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor) -> torch.Te
embedding_dimension = 1024

def get_inputs():
Q = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension, device='cuda', dtype=torch.float16)
K = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension, device='cuda', dtype=torch.float16)
V = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension, device='cuda', dtype=torch.float16)
Q = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension)
K = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension)
V = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension)
return [Q, K, V]

def get_init_inputs():
Expand Down
Loading