← Blog

Standard Kernel Rubric: Evaluating Kernel Generation Systems

Kernel generation, once a niche systems topic, has moved into the mainstream of AI research and engineering since the release of KernelBench, which has now become a staple benchmark for LLMs and agentic coding systems. New papers, demos, and repos are constantly tackling kernels directly and reporting increasingly strong results, prompting the question: Is kernel generation solved?

The growing body of results can be difficult to interpret, especially for those not immersed in GPU systems. Not all ‘wins’ are equivalent. Achieving a 4x speedup on LayerNorm is far easier than squeezing a 1.05x gain over cuBLAS on GEMM, since the latter already represents one of the most heavily optimized kernels in computing. Similarly, emitting a FP32 GEMM kernel is easier than generating an FP16/FP8/FP4 tensor-core-optimized implementation that fully utilizes specialized hardware (KernelBench uses FP32 by default). Producing Triton code is also an easier problem than generating CUDA C++ with inline PTX, which requires direct control of low-level hardware instructions. Finally, generating kernels with substantial human guidance is different from building fully autonomous systems that consistently reach state-of-the-art performance.

To make this landscape easier to reason about, we introduce the Standard Kernel Rubric, a structured way to evaluate kernel generation systems along five axes:

  • Kernel Complexity (K) — what type of computation the kernel implements
  • Representation Level (R) — the abstraction level used to express the kernel
  • Hardware Specialization (H) — the degree to which the kernel targets specialized hardware features
  • Performance Target (P) — the performance bar the system aims to reach
  • Automation Level (A) — how much human involvement is required

This rubric helps clarify what has actually been achieved, what remains open, and where the frontier of automated kernel generation lies. In the sections that follow, we describe each axis in more detail.

Table summarizing the Standard Kernel Rubric:

DimensionLevel 1Level 2Level 3Level 4
Kernel Complexity (K)Simple / Memory-Bound: Elementwise ops and basic reductions (add, mul, sum, mean, GELU, small fusions).Structured Primitives: Moderate structure with reductions and elementwise ops (LayerNorm, RMSNorm, Softmax).Dense Core Linear Algebra: High arithmetic intensity and regular tiling under standard precision (MatMul, Conv2D, batched GEMM, naïve attention in FP32 or FP/BF16).Frontier / Architecture-Coupled: Deeply fused or novel operators tightly coupled to hardware (custom attention, block-sparse kernels, mega-kernels, FP8/FP6/FP4 dense linear algebra).
Representation Level (R)Library Composition: Compose existing primitives; execution handled by libraries (cuBLAS, cuDNN, CUTLASS).High-Level DSL: Kernels written in DSLs where the compiler manages scheduling and mapping (Triton, CuTile).Lower-Level DSL: Explicit tiling, data movement, and compute primitives with some abstraction (CUTLASS CuTe).Instruction-Level Programming: Direct hardware control via CUDA C++ and inline PTX with explicit thread orchestration.
Hardware Specialization (H)Portable Implementation: General constructs with no architecture-specific features.Accelerator-Aware: Explicitly targets specialized compute units (e.g., Tensor Cores via WMMA/MMA).Architecture-Optimized Pipelines: Uses architecture-specific memory and execution mechanisms (e.g., cp.async, SM-tuned pipelines).Frontier Hardware Features: Exploits newest capabilities (e.g., WGMMA, TMA, tcgen05, cluster execution).
Performance Target (P)Functional: Produces correct results.Loosely Competitive: Within ~50% of state-of-the-art.Near State-of-the-Art: Within ~5–10% of best known implementations.State-of-the-Art: Matches or exceeds the best known implementations.
Automation Level (A)Expert Co-Design: Human defines detailed strategy and guides optimization.Guided Optimization: Human specifies high-level strategy; AI system implements and iterates.Minor Corrections: AI system generates kernel; human only fixes small errors.Fully Autonomous: AI system generates a correct and performant kernel from minimal specification.

1. Kernel Complexity (K)

What type of computation does the kernel implement?

Different kernels impose very different algorithmic and optimization challenges, ranging from simple memory-bound operations to complex, deeply fused compute pipelines.

Level 1: Simple / Memory-Bound

Kernels dominated by simple elementwise operations or basic reductions with little structural complexity. Examples include add, multiply, sum, mean, GELU, and small fusions such as bias plus activation.

Level 2: Structured Primitives

Kernels with moderate structure and soft data dependencies, typically involving reductions across one dimension combined with elementwise operations. Examples include LayerNorm, RMSNorm, Softmax.

Level 3: Dense Core Linear Algebra

Kernels implementing dense linear algebra with high arithmetic intensity and regular tiling patterns under standard precision. Examples include matrix multiplication (MatMul), Conv2D, batched GEMM, and naïve attention. While both FP32 and FP16/BF16 fall into this category, although achieving peak performance with FP16/BF16 typically requires greater hardware specialization (discussed further in the Hardware Specialization section).

Level 4: Frontier / Architecture-Coupled Kernels

Kernels implementing novel operators or techniques that require tight coupling to hardware behavior or emerging numerical formats. Examples include custom attention variants, block-sparse kernels, large fused mega-kernels, and dense linear algebra kernels targeting emerging low-precision formats such as FP8, FP6, or FP4. These kernels often introduce additional complexity due to quantization schemes, scaling strategies, packing formats, and hardware-specific instructions.

2. Representation Level (R)

At what representation level is the kernel expressed?

Kernels can be generated at different representation levels, from composing library primitives to directly programming hardware instructions. Higher-level representations compress the design space by delegating scheduling and hardware mapping to compilers or libraries, which simplifies generation but limits the ability to fully explore hardware-specific optimizations and reach peak performance.

Level 1: Library Composition

The implementation composes existing high-performance primitives rather than implementing kernels directly. Execution strategy and optimization are handled by libraries. On GPUs, this typically means calling cuBLAS, cuDNN, or CUTLASS kernels.

Level 2: High-Level DSL

Kernels are written in high-level domain-specific languages where scheduling, memory hierarchy usage, and hardware instruction selection are largely handled by the compiler. On GPUs, examples include Triton or CuTile.

Level 3: Lower-Level DSLs

Kernels are written in structured DSLs that expose tiling, data movement, and compute primitives while still providing some abstraction such as layout calculation. On GPUs, examples include frameworks such as CUTLASS CuTe.

Level 4: Instruction-Level Kernel Programming

Kernels are implemented with full control over execution and hardware instructions. On GPUs, this typically means writing CUDA C++ with inline PTX, explicitly managing threads, synchronization, memory movement, and architecture-specific features.

3. Hardware Specialization (H)

How tightly does the kernel target a specific hardware architecture?

Implementations may range from portable kernels to highly architecture-specific implementations that exploit specialized instructions and execution mechanisms. This dimension is often correlated with achieving high performance, as modern accelerators expose specialized hardware units and memory mechanisms that must often be explicitly targeted in lower-level implementations (but it's also possible to use these features inefficiently). In higher-level representations (R1 and R2), much of this specialization is delegated to the compiler and runtime, so this dimension is most meaningful for lower-level implementations (R3 and R4).

Level 1: Portable Implementation

The kernel uses general programming constructs expected to run across multiple hardware architectures and avoids architecture-specific instructions or tuning. On GPUs, this corresponds to standard CUDA kernels that do not explicitly use Tensor Cores or specialized memory instructions.

Level 2: Accelerator-Aware Implementation

The kernel explicitly targets specialized compute units available on the hardware. The implementation manages data layouts and tiling to map computation efficiently onto these units. On GPUs, this corresponds to explicitly using Tensor Core instructions such as WMMA or MMA.

Level 3: Architecture-Optimized Pipelines

The kernel is tuned to a specific architecture using specialized instructions for memory movement, synchronization, and execution scheduling. On GPUs, examples include asynchronous memory operations (cp.async), shared-memory staging, and software pipelines optimized for a specific SM generation.

Level 4: Frontier Hardware Features

The kernel exploits the newest architecture-specific capabilities to maximize performance. On GPUs, this includes features such as warpgroup matrix instructions (WGMMA), Tensor Memory Accelerator (TMA), Blackwell tensor instructions (e.g., tcgen05), cluster-level execution, and advanced warpgroup or cluster-level coordination.

4. Performance Target (P)

What level of performance is required for success?

Generated kernels can vary from merely functional implementations to ones that match or exceed highly optimized expert kernels. Generating correct code is often much easier than producing implementations that approach state-of-the-art efficiency.

Level 1: Functional

The kernel compiles and produces correct results for the intended inputs.

Level 2: Loosely Competitive

The kernel achieves performance within roughly 50% of state-of-the-art implementations.

Level 3: Near State-of-the-Art

The kernel matches performance within approximately 5–10% of the best known implementations.

Level 4: State-of-the-Art

The kernel exceeds the performance of the best known implementations.

5. Automation Level (A)

How much human involvement is required?

Kernel generation systems vary from expert-guided co-design to fully autonomous generation from a simple specification.

Level 1: Expert Co-Design

A human provides a detailed implementation strategy and guides the optimization process with fine-grained feedback. The AI system primarily assists by translating the design into code, fixing syntax, and supporting iterative edits; in some cases the human effectively writes most of the kernel.

Level 2: Guided Optimization

A human specifies the high-level strategy, such as the tiling scheme, parallelization pattern, or optimization techniques. The AI system implements the detailed kernel logic, produces correct code, and performs iterative refinements.

Level 3: Minor Corrections

The kernel is largely generated by the AI system. A human only makes small corrections, such as fixing minor errors or adjusting parameters, without changing the overall strategy or structure of the implementation.

Level 4: Fully Autonomous

The kernel is generated from a simple task description (e.g., “write a fast kernel for X” or a PyTorch reference implementation) with no human edits. The AI system independently produces a correct and performant implementation.

Classifications

As of March 10th, 2026, we use this rubric to discuss what we believe is solved, what remains open, and what comes next in this post.