ScalingIntelligence · PaliC · Dec 17, 2025 · Dec 17, 2025 · Dec 17, 2025 · Dec 17, 2025
diff --git a/.env.example b/.env.example
@@ -9,7 +9,7 @@ OPENAI_API_KEY=sk-...
 ANTHROPIC_API_KEY=sk-ant-api03-...
 
 # Google Gemini
-GEMINI_API_KEY=...
+GEMINI_API_KEY=
 
 # DeepSeek
 DEEPSEEK_API_KEY=sk-...

diff --git a/.gitignore b/.gitignore
@@ -11,3 +11,5 @@ cache/*
 !results/timing/
 .env
 _build_cache/
+uv.lock
+CLAUDE.md
diff --git a/EVAL.md b/EVAL.md
@@ -0,0 +1,57 @@
+# Evaluation
+[WIP] More notes on Benchmarking Guide
+To be updated more comprehensively with the benchmarking guide (ongoing PRs) & blog that we have been working on this quarter.
+
+You should be **extra CAREFUL!** , be always paranoid about suspiciously good results — kernel engineers and existing compilers are already pretty good, so a >2x speedup for anything is highly unlikely. 
+
+
+> “if you beat cudnn by more than 10%, think again” -- 
+from [itsclivetime](https://x.com/itsclivetime/status/1992155951630307633?s=46)
+
+
+If the model can reward hack, it will find ways to reward hack! This can especially happen during RL training or evolutionary search.
+
+Check out resources here:
+- KernelBench [v0.1 Release](https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/) 
+- Cognition and Stanford's [Kevin](https://arxiv.org/abs/2507.11948) project on various hacking behaviors observed in RL training
+- Jiwei Li's awesome [blogpost 1](https://deep-reinforce.com/defense_kernel_hack.html) and [blogpost 2](https://deep-reinforce.com/correctness_check.html) on Hacks and Defenses in Automatic GPU Kernel Generations
+
+Our ongoing blogpost and PRs try to systematize and list out these behaviors and provide tests, detection, and mitigation toolings.
+
+**Disclaimer**: KernelBench is an open-source evaluation framework. Due to limited bandwidth, the KernelBench team does not inspect, validate, or endorse any third-party kernels or reported results. Users are welcome to use the software infrastructure for evaluation, but should independently verify all results.
+
+
+## Methodology
+More on that coming.
+
+To ensure **consistency and reproducibility**, we recommend using `modal` and we have provided / are adding more various modal cloud functions to standardize the evaluation environment.
+
+### Correctness
+More coming. We also want to highlight community effort such as [BackendBench](https://www.youtube.com/watch?v=BTfjdyZOKww).
+
+### Performance
+We highly recommend watching this [lecture](https://www.youtube.com/watch?v=1i7dxoAfKOU) from GPU mode on kernel profiling. 
+
+We have (and continue to) implement various approaches to conduct kernel timing to understand the tradeoffs.
+
+Check out `timing.py` to see available timing methods and `src/unit_tests/test_eval_timing.py` to test out various timing methods (including leveraging `cuda_event` marker, Triton `do_bench`, `host_time` E2E time). @palic and team is working on a blogpost explaining the different tradeoffs soon. 
+
+### Profiling
+We have experimental profiling support leveraging NVIDIA NCU in `profile.py`.
+
+### Checkers
+There are potentially many ways model might reward hack and we would like to catch the known ways through checkers [experimental and WIP]. We start with `kernel_static_checker.py`, which is a regex-based checker on the genenrated code against set of rules. We plan to add AST-based, LM-as-a-judge, and more runtime checks in the future. We welcome suggestions and contributions here.
+
+### Unit Tests with Adversarial Examples
+We've included some unit tests for the eval script in `src/unit_tests/test_eval_adversarial.py`. These tests run adversarial kernels (see `src/unit_tests/test_kernels/`) that contain examples of reward hacking that we've seen from LLMs and ensures that the eval script catches them, either by failing their correctness checks or flagging them for excessive speedups. Examples include:
+- Reusing computations cached during the PyTorch reference
+- Modifying inputs to cheat correctness checks
+- Moving computation to a non-default CUDA stream
+
+We will continue to add more tests as we explore additional adversarial scenarios.
+
+
+Note: KernelBench is an ongoing open-source effort — please help us with issues and PRs!
+
+
+Shoutout to @bkal01, @palic, @miru_why, @ngc92, @itsclivetime, for their suggestions and feedback. 
diff --git a/KernelBench/changelog/v0.1 b/KernelBench/changelog/v0.1
@@ -0,0 +1,2 @@
+Please refer to KernelBench v0.1 release note
+https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/ 
diff --git a/KernelBench/changelog/v0.2 b/KernelBench/changelog/v0.2
@@ -0,0 +1,5 @@
+Ongoing Effort
+
+Updated Level1/12_Matmul_with_diagonal_matrices_.py - More efficient PyTorch implementation
+Updated Level1/92_cumsum_exclusive.py - Fix exclusive cumsum implementation
+Updated Level1/97_ScaledDotProductAttention.py - remove device and precision definition setting, harness should handle it
diff --git a/KernelBench/level1/12_Matmul_with_diagonal_matrices_.py b/KernelBench/level1/12_Matmul_with_diagonal_matrices_.py
@@ -20,7 +20,9 @@ def forward(self, A, B):
         Returns:
             torch.Tensor: The result of the matrix multiplication. Shape: (N, M).
         """
-        return torch.diag(A) @ B
+        # Logically equivalent to torch.diag(A) @ B 
+        # more efficient as no need to materialize a full N×N matrix
+        return A.unsqueeze(1) * B
 
 M = 4096
 N = 4096

diff --git a/KernelBench/level1/92_cumsum_exclusive.py b/KernelBench/level1/92_cumsum_exclusive.py
@@ -14,8 +14,8 @@ def __init__(self, dim):
         self.dim = dim
 
     def forward(self, x):
-        exclusive_cumsum = torch.cat((torch.zeros_like(x.select(self.dim, 0).unsqueeze(self.dim)), x), dim=self.dim)[:-1]
-        return torch.cumsum(exclusive_cumsum, dim=self.dim)
+        cumsum = torch.cumsum(x.narrow(dim=self.dim, start=0, length=x.size(self.dim)-1), dim=self.dim)
+        return torch.cat((torch.zeros_like(x.select(self.dim, 0).unsqueeze(self.dim)), cumsum), dim=self.dim)
 
 batch_size = 32768
 input_shape = (32768,)

diff --git a/KernelBench/level1/97_ScaledDotProductAttention.py b/KernelBench/level1/97_ScaledDotProductAttention.py
@@ -15,9 +15,9 @@ def forward(self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor) -> torch.Te
 embedding_dimension = 1024
 
 def get_inputs():
-    Q = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension, device='cuda', dtype=torch.float16)
-    K = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension, device='cuda', dtype=torch.float16)
-    V = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension, device='cuda', dtype=torch.float16)
+    Q = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension)
+    K = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension)
+    V = torch.rand(batch_size, num_heads, sequence_length, embedding_dimension)
     return [Q, K, V]
 
 def get_init_inputs():
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		Please refer to KernelBench v0.1 release note
		https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/