Stanford AI optimizes GPU kernel 50% faster than human experts using test-time learning

Researchers from Stanford, Nvidia, and Together AI developed TTT-Discover, a technique that updates model weights during inference to solve specific problems. The method produced a TriMul GPU kernel running at 1,161 microseconds on H100s, beating the best human-written code by 15%. Cost per discovery: $500.

The Biggish Editorial · Thursday, February 5, 2026

The Result

A research collaboration between Stanford, Nvidia, and Together AI has produced GPU kernel code that outperforms expert-written implementations by 15-50%, depending on hardware. Their TriMul kernel, critical for applications like AlphaFold, runs in 1,161 microseconds on H100 GPUs compared to 1,371 microseconds for the previous best human submission.

The technique, called Test-Time Training to Discover (TTT-Discover), was published January 22 as an arXiv paper. It represents a departure from current reasoning model approaches.

How It Works

Standard reasoning models use fixed parameters during inference, searching for answers within their training data. TTT-Discover updates model weights in real-time, treating each problem as an environment to master rather than a query to answer.

The method uses two key mechanisms. First, an "entropic objective" that exponentially rewards high-performing outliers rather than optimizing for average results. Second, a PUCT search algorithm (borrowed from AlphaZero) that explores solution paths and builds training datasets during inference.

The approach requires continuous reward signals like runtime metrics or error rates, not binary pass/fail feedback. This works well for optimization problems but limits applicability elsewhere.

The Infrastructure Angle

Kernels optimized on H100 hardware generalized to A100, B200, and MI300X architectures without retraining. The researchers used an open-weight model (gpt-oss-120b), not a proprietary frontier model.

Each discovery run costs approximately $500, involving 50 training steps and thousands of rollouts. The economics make sense for static, high-value assets. A 1% improvement in a nightly data pipeline processing petabytes translates to significant annual compute savings.

What's Missing

The method's effectiveness on problems beyond GPU kernel optimization remains less clear. Test-time compute costs need evaluation against performance gains for each use case. The paper promises released code, but independent verification across diverse hardware is pending.

For B200 and MI300X testing, researchers conducted manual trials due to infrastructure limitations on GPUMode's server, unable to submit kernels through standard channels. Expert organizers across mathematics, GPU engineering, and algorithm design validated the solutions.

Context

This competes with established approaches: manual CUTLASS optimization, TVM's Ansor auto-tuning, and various ML-based kernel compilation frameworks. The difference is TTT-Discover's willingness to overfit completely to a single problem instance, discarding the model after producing the optimized artifact.

Mert Yuksekgonul, a Stanford PhD student and co-author, framed it clearly: "Thinking models wouldn't be able to prove P != NP without test-time training, just like Andrew Wiles wouldn't prove Fermat's Last Theorem without 7 years pursuing this single problem."

The real question is whether enterprises running mission-critical workloads will pay $500 per optimization for kernels that might save millions annually. Early signs suggest yes, for the right problems.

The Result

How It Works

The Infrastructure Angle

What's Missing

Context

Related Articles

Reddit pivots to AI search, claims 15M weekly users but no monetization yet

Apple reportedly cancels AI health coach despite spring 2026 launch plans

Amazon commits $200B to AI infrastructure in 2026, Google follows at $185B