We're building AI infrastructure with AI.
Our journey starts with kernels, low-level code that execute computations on hardware accelerators like GPUs and TPUs. Kernels power all modern AI workloads by efficiently running operations such as matrix multiplications (matmuls), convolutions, and attention mechanisms. Well-optimized kernels are critical to unlocking the full potential of hardware, yet currently kernel development is a long, tedious cycle of profiling, tuning, and low-level optimization that demands deep expertise. Kernel engineers are scarce and often command high compensation packages, putting them out of reach for most organizations. Even with extensive kernel engineering expertise, optimal implementations usually arrive only late in a project (if at all) due to limited bandwidth and long development cycles. As a result, AI performance continually lags behind what silicon is capable of.
Kernels are the inner-loop of AI infrastructure.
We think solving this problem of kernel engineering is profoundly impactful, considering the gains that can be unlocked across all levels of the AI stack once kernels are no longer a bottleneck:
- Model: Kernels enable rapid iteration to explore novel model architectures. Emerging AI models often introduce operations that don't map cleanly to existing hardware libraries; without custom kernels, they can be too inefficient to run.
- Deployment: Optimized kernels are essential to real-time AI deployments, especially in systems with low-latency requirements. Better kernels enable the use of smarter models which enable new and useful applications.
- Cloud Infrastructure: Neo-cloud providers compete on efficiency. Even modest kernel-level improvements compound at scale, directly lowering serving expenses and enabling faster services.
- Hardware: Kernels are not only critical to GPUs but also to the adoption of emerging accelerators. Automatic kernel generation shortens the gap between new hardware designs and their practical deployment in the field.
We are working at the frontier of AI and computer systems.
Members of our team have backgrounds in AI and computer systems from Stanford, MIT, UIUC, and other leading institutions, and work at the intersection of machine learning and high-performance systems. We approach this space with a research-driven mindset, valuing rigor and understanding as well as outcomes. We are proud to have contributed to pioneering research in the AI kernel generation space: KernelBench provided the first standardized benchmark for AI kernel generation and has since attracted a wide range of solutions built on top of it. "Surprisingly Fast AI-Generated Kernels..." showed that with the right scaffolding, LLMs perform well on memory-bound workloads and older chip architectures with well-documented patterns and abundant training data.
The harder challenge for kernel generation solutions appear when dealing with compute-bound workloads, newer hardware, and lower-precision formats. Achieving peak efficiency here requires using specialized hardware units like tensor cores and tensor memory accelerators, which are sparsely documented and complex to use, often only directly programmable through inline assembly (PTX) instructions, making them especially difficult for LLMs to optimize. Yet these features are essential to modern AI workloads, where most of the speedups come from exploiting them effectively.
We believing in solving kernel generation at a fundamental level with full control of the hardware.
Our goal is to generate high-performance kernels for modern architectures that fully exploit new hardware features. While higher-level abstractions like domain-specific languages (e.g., Triton) simplify certain classes of kernels, they abstract away many low-level decisions. This results in limited hardware control (lower performance thresholds) and constraining the design space available to the AI system, since much of the optimization is delegated to the compiler. By operating closer to the metal at the CUDA+PTX level, we retain precise control over architectural details and can leverage new silicon capabilities sooner without waiting for compiler or ecosystem support to catch up.
Some of our results that would not have been possible with higher-level abstractions:
- H100 BF16 matmul: 99.08%-115.77% of the performance of cuBLAS
- H100 BF16 attention: 104% the performance of Hopper FlashAttention3 forward pass
- Llama 3 FFN: 20% speedup on the Llama 3 feed-forward network (FFN) through an automatically discovered fusion strategy of compute-bound kernels involving two matmuls and a SiLU
We're growing — come build with us!
We have raised from a set of exceptional investors who share our excitement and vision, including General Catalyst, Felicis, and amazing angels. Our team has deep experience across CUDA, AI, and computer systems. If you share our excitement and want to build with us, we'd love to hear from you!