We're building AI infrastructure with AI.

Our journey starts with kernels, low-level code that execute computations on hardware accelerators like GPUs and TPUs. Kernels power all modern AI workloads by efficiently running operations such as matrix multiplications (matmuls), convolutions, and attention mechanisms. Well-optimized kernels are critical to unlocking the full potential of hardware, yet currently kernel development is a long, tedious cycle of profiling, tuning, and low-level optimization that demands deep expertise. Kernel engineers are scarce and often command seven-figure compensation, putting them out of reach for most organizations. Even with extensive kernel engineering expertise, optimal implementations usually arrive only late in a project (if at all) due to limited bandwidth and long development cycles. As a result, AI performance continually lags behind what silicon is capable of.

We think solving this problem is profoundly impactful, considering the gains that can be unlocked across all levels of the AI stack once kernels are no longer a bottleneck:

  • Model: Kernels enable rapid iteration to explore novel model architectures. Emerging AI models often introduce operations that don't map cleanly to existing hardware libraries; without custom kernels, they can be too inefficient to run.
  • Deployment: Optimized kernels are essential to real-time AI deployments, especially in systems with low-latency requirements. Better kernels enable the use of smarter models which enable new and useful applications.
  • Cloud Infrastructure: Neo-cloud providers compete on efficiency. Even modest kernel-level improvements compound at scale, directly lowering serving expenses and enabling faster services.
  • Hardware: Kernels are not only critical to GPUs but also to the adoption of emerging accelerators. Automatic kernel generation shortens the gap between new hardware designs and their practical deployment in the field.

We are proud to have already contributed to pioneering research in the AI kernel generation space. KernelBench, released almost a year ago, provided the first standardized benchmark for AI kernel generation, and has since attracted a wide range of solutions built on top of it. "Surprisingly Fast AI-Generated Kernels..." showed that with the right scaffolding, LLMs perform well on memory-bound workloads and older chip architectures with well-documented patterns and abundant training data. For example, it achieved 484% of Pytorch baseline performance on FP32 precision LayerNorm on an NVIDIA L40S GPU and 290% on FP32 precision Conv+ReLU+Bias. The main limitations of early kernel generation solutions, however, appeared in three areas:

  1. Compute-bound workloads involving matmuls that require specialized hardware units
  2. Newer hardware such as the Nvidia H100
  3. Lower-precision formats such as FP16, FP8, and beyond

Achieving peak efficiency here requires using specialized hardware units like tensor cores and tensor memory accelerators, which are sparsely documented and complex to use, often only directly programmable through inline assembly (PTX) instructions, making them especially difficult for LLMs to optimize. Yet these features are essential to modern AI workloads, where most of the speedups come from exploiting them effectively.

We're working on this harder problem to generate high-performance kernels for newer architectures that fully exploit new hardware features. We opted to work directly in CUDA C and assembly rather than relying on higher-level abstractions like domain-specific languages (e.g., Triton) or libraries (e.g., CUTLASS). This approach gives us fine-grained control over the hardware and frees us from depending on compiler or library updates, which often take time to support the latest hardware features. To enable our system to generate these types of codes, we engineered a set of core building blocks in pure CUDA-C and PTX that enable H100 BF16 matmul 102%-105% the performance of cuBLAS in only 100 lines of code, and H100 BF16 attention 104% the performance of Hopper FlashAttention3 forward pass in 500 lines of code. These high-performance and concise building blocks enable the automatic generation of fused kernels over arbitrary sets of operations including matmuls. Notably, our system delivered a real world speedup on the Llama 3 feed-forward network (FFN) by 20% through an automatically discovered fusion strategy of compute-bound kernels involving two matmuls and a SiLU, surprising us with how much performance remains to be captured on one of the most heavily used AI workloads today.

We're excited to be working on building AI infrastructure with AI, and we're hiring. We have raised from a set of exceptional investors who share our excitement and vision, including General Catalyst, Felicis, and amazing angels. Our team has deep experience across CUDA, AI, and computer systems. If you share our excitement and want to build with us, we'd love to hear from you!

🔗 Join us!

hi@standardkernel.com @Standard_Kernel