Skip to content
Weights + KV + optimizer vs HBM · parallelism advice

Model Fit Checker

Will the model fit? Sum the weights, KV cache and optimizer states against your GPUs' HBM, get the minimum GPU count, and a parallelism or quantization recommendation when it doesn't fit.

01 · Quick check

Params, precision & GPU HBM → does it fit?

Fits?
NO
Memory / min GPUs
319GB
min 4 GPUs
Memory breakdown & parallelism ↓
02 · Deep analysis

Memory budget console

Memory breakdown
319 / 160 GB
Weights 140GB
KV cache 172GB
Workspace 7GB

100% of 160 GB available across 2 × 80 GB GPUs. Leave headroom — fitting to 100% fails in practice.

Weights
140 GB
fp16
KV cache
172 GB
8192 ctx × 8
Min GPUs
4
80GB each
Does not fit 2 × 80 GB — needs 4

The 70B model in inference needs 319 GB (140 weights, 172 KV). Weights exceed one GPU — tensor-parallel across ≥2 GPUs, or quantize to a smaller precision.

Quantizing to int4 would cut weights to 35 GB.

Check the bandwidth bottleneck in the HBM Bandwidth console; size serving in LLM Serving.

Why it matters

Why memory, not compute, is the first wall

Weights alone often exceed one GPU

A 70B model in fp16 is ~140 GB — already past a single 80 GB GPU before any cache or activations. Model size, not compute, is frequently the first wall.

Quantization is the cheapest capacity win

Dropping from fp16 to int8 halves the weight memory; int4 quarters it. A model that needs two GPUs at fp16 can fit on one at int4 — often the difference between feasible and not.

Training needs ~8× the memory of inference

Mixed-precision Adam keeps fp16 weights, fp16 gradients, an fp32 master copy and two moment estimates — roughly 16 bytes per parameter versus 2 for fp16 inference. That's why training a model takes far more GPUs than serving it.

The KV cache grows with context and batch

Long contexts and large batches inflate the key/value cache, which can rival the weights for big-context serving. It's the memory term that scales with how you use the model, not just its size.

Field notes

Counting bytes before FLOPs

The first question in deploying a large model isn't how fast it runs — it's whether it fits. Memory, not compute, is usually the wall you hit first, because a model's parameters have to physically live in the accelerator's HBM, and modern models are enormous. A 70-billion-parameter model in half precision is 140 gigabytes of weights alone, already past a single 80 GB GPU before a single token is processed.

Precision is the lever that moves that number most. Each parameter takes four bytes in fp32, two in fp16 or bf16, one in int8, half in int4 — so quantizing weights halves or quarters the memory, and a model that needs two GPUs at fp16 can fit on one at int4. With modern quantization methods the accuracy cost is often small, which is why int8 and int4 deployment is everywhere: it's the cheapest capacity win available, and watching the memory fall across precisions is the first thing to try when a model doesn't fit.

Inference adds the KV cache — the stored attention keys and values that grow with context length and batch size, and for long-context, high-concurrency serving can rival the weights themselves. Training is a different regime entirely: mixed-precision Adam keeps fp16 weights, fp16 gradients, an fp32 master copy and two moment estimates, roughly sixteen bytes per parameter against two for inference. That eightfold overhead is why training a model takes far more GPUs than serving it, and why sharding techniques like ZeRO and FSDP exist.

When the total exceeds one GPU, you split the model — tensor parallelism across fast interconnect for low latency, pipeline parallelism for less bandwidth pressure, ZeRO/FSDP to shard training state. This checker tells you which regime you're in and what to do. Once it fits, the next bottleneck is often bandwidth, not capacity — check the HBM Bandwidth console — and size the deployment in the LLM Serving console.

Model Fit FAQs

Have more questions? Contact us

Trusted by ML Systems & Deployment Teams

4.8
Based on 3,090 reviews

Weights + KV + optimizer against HBM with a parallelism recommendation is exactly the fit check I do before any deployment. The 70B-fp16-needs-two-GPUs result is instant, and switching to int4 to fit one is the lever we use daily. Training's 8× memory is correctly modeled.

D
Dr. Arjun Rao
ML systems engineer
June 14, 2026

The KV-cache-grows-with-context-and-batch point is the one that bites long-context serving, and this surfaces it. Min-GPU-count and the tensor-vs-pipeline-vs-ZeRO guidance match our deployment playbook. Pairs perfectly with the LLM serving and HBM bandwidth tools.

S
Sophie Tan
Inference infrastructure
May 14, 2026

Optimizer states dominating training memory, with the ZeRO/FSDP recommendation, is spot on — it's why our 70B run needs far more GPUs than serving it. Would love activation-checkpointing modeling, but as a first-order fit checker it's exactly right.

M
Marcus Klein
LLM training lead
March 24, 2026

The int4 edge preset nails our constraint — does the model fit a 24 GB device. Quantization as the cheapest capacity win is the daily reality, and seeing the memory drop across precisions is perfect. Fast and accurate.

L
Lena Park
Edge AI deployment
December 30, 2025

Love using our calculator?

Connected instruments

Related tools

Similar Calculators

More tools in the same category

Inference Cost Calculator

Estimate deployment costs for AI models across cloud, edge, and hybrid infrastructures with per-query, per-token, and per-hour pricing models. Integrates GPU/ASIC rental rates, network egress, storage, and scaling overhead for accurate inference TCO analysis.

Training Cost Calculator

Calculate AI model training expenses including GPU cluster rental, data transfer, checkpoint storage, and engineering time with distributed-training overhead modeling. Supports LLM, vision, and multimodal training with FLOPs-to-cost mapping and carbon-footprint estimation.

GPU Cluster Sizing

Determine optimal GPU cluster configurations for training and inference workloads with interconnect topology modeling, memory-bandwidth balancing, and fault-tolerance planning. Supports NVIDIA, AMD, and custom accelerator clusters with InfiniBand and NVLink network analysis.

HBM Bandwidth Calculator

Estimate memory bandwidth requirements for AI workloads with operation-type analysis, data-movement profiling, and roofline model integration. Calculates HBM generation selection, channel count, and clock-speed requirements to eliminate memory-bound bottlenecks.

AI Chip Comparator

Compare AI accelerators across performance, cost, power, and software-ecosystem metrics with normalized benchmarking for training and inference workloads. Supports NVIDIA, AMD, Intel, Google TPU, Amazon Trainium, and custom ASICs with TCO-per-FLOP analysis.

Token Cost Estimator

Calculate infrastructure costs per token generated for LLM serving with batch-size optimization, KV-cache management, and speculative decoding impact. Models pricing for API providers and self-hosted deployments with demand-spike handling and multi-model routing.

Often Used Together

Complementary tools for complete analysis

Learn More

Related Articles

Dive deeper with our expert guides and tutorials related to Model Fit Checker

Loading articles...

memory = weights (params × bytes) + KV cache + optimizer (training) vs GPUs × HBM · Last reviewed: 2026-06