HBM Bandwidth Console
Most AI kernels are limited by memory bandwidth, not FLOPS. Plot a workload on the roofline by its arithmetic intensity, find the ridge point, and see whether it's memory- or compute-bound — and how much of peak it can actually reach.
GPU & workload arithmetic intensity → bound & attainable.
H100: 990 TFLOPS · 3.35 TB/s HBM
Roofline console
At 1.5 FLOP/byte — below the 296 ridge — this kernel reaches only 5 TFLOPS (1% of peak). It's starved for bandwidth; adding compute does nothing. Raise intensity (batching, fusion) or use more HBM bandwidth.
To become compute-bound at this intensity you'd need 660.0 TB/s of bandwidth — vs 3.35 available.
For LLM decode, batching lifts intensity — model the serving in the LLM Serving console; size HBM in HBM Cost.
Why bandwidth is the real bottleneck
Below the ridge point a kernel is memory-bound — limited by how fast data moves, not how fast the chip computes. Above it, compute-bound. The ridge = peak FLOPS ÷ peak bandwidth tells you which world you're in.
Generating tokens one at a time has an arithmetic intensity near 1 — far below any GPU's ridge of hundreds — so decode runs at a tiny fraction of peak FLOPS. The bottleneck is HBM bandwidth, not compute.
For a memory-bound kernel, adding compute does nothing — only more bandwidth helps. This is why HBM generations (and their bandwidth) matter as much as the FLOPS headline for real AI workloads.
Raising arithmetic intensity — by batching, fusing kernels, or reusing data — pushes a memory-bound workload toward the ridge, recovering compute that was sitting idle. It's the main software lever.
The chip is usually waiting for memory
The headline number on an accelerator is its FLOPS, but for a great deal of real AI work that number is a fiction — the compute units sit idle, waiting for data to arrive from memory. The roofline model makes this concrete by plotting attainable performance against arithmetic intensity, the FLOPs a kernel does per byte it moves. The result has two regimes divided by a ridge point, and which side you're on determines everything about how to make it faster.
Below the ridge — which on a modern GPU sits at hundreds of FLOPs per byte — a kernel is memory-bound: its speed is the bandwidth times its intensity, and the expensive compute units are starved. Above the ridge it's compute-bound, finally saturating those units. The ridge itself is peak FLOPS divided by peak bandwidth, and because FLOPS have grown faster than bandwidth for years, that ridge keeps rising — pushing more and more kernels into the memory-bound regime where the bandwidth number, not the FLOPS number, is the performance number.
The starkest example is LLM token generation. Generating one token reads the entire model's weights from HBM but does only a little arithmetic with each byte — an intensity near one, hundreds of times below the ridge. So decode runs at a single-digit percentage of the chip's peak FLOPS, bottlenecked entirely on how fast weights stream out of memory. This is why two accelerators with very different FLOPS can serve tokens at nearly the same speed if their bandwidth is similar, and why HBM bandwidth gains matter as much as compute gains.
The good news is that arithmetic intensity is a software lever. Batching — serving many requests per weight load — fuses, tiling, and data reuse all raise the FLOPs per byte, shifting a workload rightward up the roofline toward the ridge and reclaiming idle compute. For LLM inference, batching is the dominant technique, which is exactly why serving throughput improves so much with concurrency. Model that in the LLM Serving console, and size the memory itself in the HBM Cost console.
Trusted by Kernel, Performance & Systems Teams
“Ridge point, regime, and attainable percentage of peak in one screen is exactly the first analysis I do on any kernel. Seeing LLM decode at single-digit percent of peak FLOPS — purely bandwidth-bound — is the result that reframes where to optimize. Matches my profiler's roofline.”
“The 'more FLOPS is wasted on a memory-bound kernel' point is the one that changes hardware decisions — for our inference workload HBM bandwidth is the spec that matters, not TFLOPS. Batching to move up the roofline is the lever we pull, and this shows it. Pairs perfectly with the model-fit and serving tools.”
“Clean roofline with GPU and op presets — decode vs GEMM vs elementwise is instantly clear. The required-bandwidth-to-be-compute-bound figure is a nice touch. Would love measured-intensity import, but as a bounds-and-direction tool it's exactly right.”
“Explaining to leadership why a faster-FLOPS GPU didn't speed up inference — because we're memory-bound — is a one-chart conversation here. The ridge point per GPU is the number. Fast, exact, and the regime call is always right.”
Love using our calculator?
Related tools
Similar Calculators
More tools in the same category
Inference Cost Calculator
Estimate deployment costs for AI models across cloud, edge, and hybrid infrastructures with per-query, per-token, and per-hour pricing models. Integrates GPU/ASIC rental rates, network egress, storage, and scaling overhead for accurate inference TCO analysis.
Training Cost Calculator
Calculate AI model training expenses including GPU cluster rental, data transfer, checkpoint storage, and engineering time with distributed-training overhead modeling. Supports LLM, vision, and multimodal training with FLOPs-to-cost mapping and carbon-footprint estimation.
GPU Cluster Sizing
Determine optimal GPU cluster configurations for training and inference workloads with interconnect topology modeling, memory-bandwidth balancing, and fault-tolerance planning. Supports NVIDIA, AMD, and custom accelerator clusters with InfiniBand and NVLink network analysis.
Model Fit Checker
Verify whether AI models fit within hardware constraints including GPU HBM capacity, on-chip SRAM, and interconnect bandwidth with layer-wise memory profiling. Supports model parallelism, pipeline parallelism, and ZeRO optimization recommendations for large-model deployment.
AI Chip Comparator
Compare AI accelerators across performance, cost, power, and software-ecosystem metrics with normalized benchmarking for training and inference workloads. Supports NVIDIA, AMD, Intel, Google TPU, Amazon Trainium, and custom ASICs with TCO-per-FLOP analysis.
Token Cost Estimator
Calculate infrastructure costs per token generated for LLM serving with batch-size optimization, KV-cache management, and speculative decoding impact. Models pricing for API providers and self-hosted deployments with demand-spike handling and multi-model routing.
Often Used Together
Complementary tools for complete analysis
Related Articles
Dive deeper with our expert guides and tutorials related to HBM Bandwidth Calculator
ridge = peak FLOPS ÷ peak bandwidth · attainable = min(peak FLOPS, intensity × bandwidth) · Last reviewed: 2026-06