Blog
Reimagining Kernel Generation at the PTX Layer: An LLM System Learning from DSLs to Outperform Them
We built a hybrid system where program analysis and LLMs work together to transform and optimize PTX. By operating and reasoning at the shared PTX layer across DSLs (e.g. Triton, TileLang, ThunderKittens, CUTLASS), our system learns and combines optimizations to generate kernels that outperform all individual DSLs.
Announcing Our $20M Seed Round — Is Kernel Generation a Solved Problem?
We're excited to share we've raised a $20M seed round, led by Jump Capital. We also introduce a framework for understanding what has actually been achieved in kernel generation and what remains open.
Standard Kernel Rubric: Evaluating Kernel Generation Systems
Is kernel generation solved? We introduce the Standard Kernel Rubric — a structured framework for evaluating kernel generation systems along five axes: kernel complexity, representation level, hardware specialization, performance target, and automation level.
“This Kernel Was Faster Yesterday” — In Pursuit of High-Fidelity GPU Kernel Benchmarking
GPU timing is deceptively hard. Power limits, thermal state, clock behavior, caching, and measurement method all affect results in ways that aren't obvious. We explored sources of timing variation to obtain more reliable results for kernel benchmarking, which is especially important for automated RL-based kernel optimization systems.