Model Fit Checker
Will the model fit? Sum the weights, KV cache and optimizer states against your GPUs' HBM, get the minimum GPU count, and a parallelism or quantization recommendation when it doesn't fit.
Params, precision & GPU HBM → does it fit?
Memory budget console
100% of 160 GB available across 2 × 80 GB GPUs. Leave headroom — fitting to 100% fails in practice.
The 70B model in inference needs 319 GB (140 weights, 172 KV). Weights exceed one GPU — tensor-parallel across ≥2 GPUs, or quantize to a smaller precision.
Quantizing to int4 would cut weights to 35 GB.
Check the bandwidth bottleneck in the HBM Bandwidth console; size serving in LLM Serving.
Why memory, not compute, is the first wall
A 70B model in fp16 is ~140 GB — already past a single 80 GB GPU before any cache or activations. Model size, not compute, is frequently the first wall.
Dropping from fp16 to int8 halves the weight memory; int4 quarters it. A model that needs two GPUs at fp16 can fit on one at int4 — often the difference between feasible and not.
Mixed-precision Adam keeps fp16 weights, fp16 gradients, an fp32 master copy and two moment estimates — roughly 16 bytes per parameter versus 2 for fp16 inference. That's why training a model takes far more GPUs than serving it.
Long contexts and large batches inflate the key/value cache, which can rival the weights for big-context serving. It's the memory term that scales with how you use the model, not just its size.
Counting bytes before FLOPs
The first question in deploying a large model isn't how fast it runs — it's whether it fits. Memory, not compute, is usually the wall you hit first, because a model's parameters have to physically live in the accelerator's HBM, and modern models are enormous. A 70-billion-parameter model in half precision is 140 gigabytes of weights alone, already past a single 80 GB GPU before a single token is processed.
Precision is the lever that moves that number most. Each parameter takes four bytes in fp32, two in fp16 or bf16, one in int8, half in int4 — so quantizing weights halves or quarters the memory, and a model that needs two GPUs at fp16 can fit on one at int4. With modern quantization methods the accuracy cost is often small, which is why int8 and int4 deployment is everywhere: it's the cheapest capacity win available, and watching the memory fall across precisions is the first thing to try when a model doesn't fit.
Inference adds the KV cache — the stored attention keys and values that grow with context length and batch size, and for long-context, high-concurrency serving can rival the weights themselves. Training is a different regime entirely: mixed-precision Adam keeps fp16 weights, fp16 gradients, an fp32 master copy and two moment estimates, roughly sixteen bytes per parameter against two for inference. That eightfold overhead is why training a model takes far more GPUs than serving it, and why sharding techniques like ZeRO and FSDP exist.
When the total exceeds one GPU, you split the model — tensor parallelism across fast interconnect for low latency, pipeline parallelism for less bandwidth pressure, ZeRO/FSDP to shard training state. This checker tells you which regime you're in and what to do. Once it fits, the next bottleneck is often bandwidth, not capacity — check the HBM Bandwidth console — and size the deployment in the LLM Serving console.
Trusted by ML Systems & Deployment Teams
“Weights + KV + optimizer against HBM with a parallelism recommendation is exactly the fit check I do before any deployment. The 70B-fp16-needs-two-GPUs result is instant, and switching to int4 to fit one is the lever we use daily. Training's 8× memory is correctly modeled.”
“The KV-cache-grows-with-context-and-batch point is the one that bites long-context serving, and this surfaces it. Min-GPU-count and the tensor-vs-pipeline-vs-ZeRO guidance match our deployment playbook. Pairs perfectly with the LLM serving and HBM bandwidth tools.”
“Optimizer states dominating training memory, with the ZeRO/FSDP recommendation, is spot on — it's why our 70B run needs far more GPUs than serving it. Would love activation-checkpointing modeling, but as a first-order fit checker it's exactly right.”
“The int4 edge preset nails our constraint — does the model fit a 24 GB device. Quantization as the cheapest capacity win is the daily reality, and seeing the memory drop across precisions is perfect. Fast and accurate.”
Love using our calculator?
Related tools
Similar Calculators
More tools in the same category
Inference Cost Calculator
Estimate deployment costs for AI models across cloud, edge, and hybrid infrastructures with per-query, per-token, and per-hour pricing models. Integrates GPU/ASIC rental rates, network egress, storage, and scaling overhead for accurate inference TCO analysis.
Training Cost Calculator
Calculate AI model training expenses including GPU cluster rental, data transfer, checkpoint storage, and engineering time with distributed-training overhead modeling. Supports LLM, vision, and multimodal training with FLOPs-to-cost mapping and carbon-footprint estimation.
GPU Cluster Sizing
Determine optimal GPU cluster configurations for training and inference workloads with interconnect topology modeling, memory-bandwidth balancing, and fault-tolerance planning. Supports NVIDIA, AMD, and custom accelerator clusters with InfiniBand and NVLink network analysis.
HBM Bandwidth Calculator
Estimate memory bandwidth requirements for AI workloads with operation-type analysis, data-movement profiling, and roofline model integration. Calculates HBM generation selection, channel count, and clock-speed requirements to eliminate memory-bound bottlenecks.
AI Chip Comparator
Compare AI accelerators across performance, cost, power, and software-ecosystem metrics with normalized benchmarking for training and inference workloads. Supports NVIDIA, AMD, Intel, Google TPU, Amazon Trainium, and custom ASICs with TCO-per-FLOP analysis.
Token Cost Estimator
Calculate infrastructure costs per token generated for LLM serving with batch-size optimization, KV-cache management, and speculative decoding impact. Models pricing for API providers and self-hosted deployments with demand-spike handling and multi-model routing.
Often Used Together
Complementary tools for complete analysis
Related Articles
Dive deeper with our expert guides and tutorials related to Model Fit Checker
memory = weights (params × bytes) + KV cache + optimizer (training) vs GPUs × HBM · Last reviewed: 2026-06