Why timing ML training isn't enough: roofline models reveal hardware bottlenecks

Training metrics show 1ms for a matmul kernel. Good or bad? Most teams can't tell. Roofline models, borrowed from HPC, plot compute versus memory bandwidth to show whether your LLM training is hitting GPU limits or leaving performance on the table.

The Biggish Editorial · Thursday, February 5, 2026

Why timing ML training isn't enough: roofline models reveal hardware bottlenecks

You're training an LLM. Your matmul kernel takes 1ms. Is that good? Most engineering teams can't answer.

The roofline model, a visualization technique from high-performance computing, is finding new life in ML ops. It plots compute throughput (flops per second) against operational intensity (flops per byte), with the "roof" separating compute-bound from memory-bound regimes.

Three factors drive GPU kernel performance: memory access time (reading from slow HBM), compute time (Tensor Core limits), and overhead (kernel scheduling, CPU coordination). The roofline approach focuses on the first two, which dominate in production training runs.

For enterprise ML teams, this matters because wall-clock time alone masks critical inefficiencies. A kernel might be slow because it's waiting on memory bandwidth, not because compute is maxed out. Different fixes entirely.

The technique is validated on dual-socket systems and NVIDIA V100s. ML adaptations track int8 matmul bandwidth (1 byte per parameter) versus bfloat16 (2 bytes). Teams at NERSC run roofline analysis on production GPU workloads.

Distributed training complicates the picture. Single-chip rooflines don't capture inter-node communication in multi-TPU or multi-GPU setups, where bandwidth between devices shifts performance thresholds. Some teams extend the model to track data size over bandwidth for collective operations.

Tools exist but require work. You need peak hardware metrics and operational intensity measurements. NVIDIA's Nsight Compute includes roofline views. PyTorch and TensorFlow teams typically write custom profiling scripts, calculating arithmetic intensity manually from kernel traces.

The real value: knowing whether to optimize memory access patterns or increase batch size for better compute utilization. Different bottlenecks, different solutions. Wall-clock timing won't tell you which.

History suggests performance modeling survives initial skepticism when it solves a measurement problem teams actually have. Whether roofline becomes standard ML tooling or remains a specialist technique depends on how much pain teams feel from not understanding their hardware limits.

Why timing ML training isn't enough: roofline models reveal hardware bottlenecks

Related Articles

Alphabet shares fall despite earnings beat as $180B AI spending spooks investors

MCP protocol reaches 10,000 servers as Linux Foundation backs standard for AI agents

AI coding tools expose what developers missed about test-driven development