Question 1

How do I check whether a model fits on my GPUs?

Accepted Answer

Add up the memory it needs and compare to your total HBM. For inference: weight memory (parameters × bytes-per-parameter, set by precision) plus the KV cache (which grows with layers, hidden size, context length and batch) plus a small activation workspace. For training, add optimizer states, which dominate. Then compare to GPUs × HBM-per-GPU. If it exceeds one GPU, you need model parallelism or quantization. This calculator computes each memory term, the total, whether it fits, the minimum GPU count, and a parallelism recommendation.

Question 2

How much memory do model weights need?

Accepted Answer

Weight memory = parameters × bytes per parameter, where the bytes depend on precision: 4 for fp32, 2 for fp16/bf16, 1 for int8, 0.5 for int4. So a 70-billion-parameter model needs about 280 GB in fp32, 140 GB in fp16, 70 GB in int8, or 35 GB in int4. This is just the weights — inference adds the KV cache and a workspace, and training adds far more for optimizer states. Weight memory is usually the first thing to check, because for large models it alone can exceed a single GPU's capacity.

Question 3

What is the KV cache and how big is it?

Accepted Answer

The key-value cache stores the attention keys and values for every token already processed, so the model doesn't recompute them each step during generation. Its size is roughly 2 (K and V) × number of layers × hidden size × context length × batch size × bytes per element. For long contexts and large batches it can grow to tens of gigabytes — sometimes rivaling the weights. It's the memory term that scales with how you use the model (context, concurrency), not just its parameter count, which is why serving many long-context users is memory-hungry. This calculator computes it from your layers, hidden size, context and batch.

Question 4

Why does training need so much more memory than inference?

Accepted Answer

Because of optimizer states. Inference needs only the weights (plus KV cache and a small workspace). Training with mixed-precision Adam keeps the fp16 weights, the fp16 gradients, an fp32 master copy of the weights, and two fp32 moment estimates (momentum and variance) — roughly 16 bytes per parameter, versus 2 for fp16 inference. That's about 8× the memory just for the model state, before activations. This is why training a model takes many more GPUs than serving it, and why memory-saving techniques like ZeRO/FSDP sharding and optimizer offload are essential for large-model training.

Question 5

How does quantization help a model fit?

Accepted Answer

Quantization reduces the bytes per parameter: fp16 (2 bytes) → int8 (1) → int4 (0.5), halving or quartering the weight memory. A 70B model is 140 GB in fp16 but 35 GB in int4 — fitting on a single 80 GB GPU instead of needing two or more. It's the cheapest and fastest way to make a model fit, widely used for inference (int8/int4 with minimal quality loss using modern methods). The trade-off is some accuracy and the need for quantization-aware kernels. This calculator lets you switch precision and see the memory and GPU-count change immediately.

Question 6

What is model parallelism and when do I need it?

Accepted Answer

When a model doesn't fit on one GPU, you split it across several. Tensor parallelism splits individual layers' matrices across GPUs (low latency, needs fast interconnect like NVLink); pipeline parallelism puts different layers on different GPUs (higher latency, less bandwidth-hungry); and for training, ZeRO/FSDP shards the weights, gradients and optimizer states across GPUs. You need parallelism whenever the total memory exceeds a single GPU's HBM — which this calculator flags, along with which approach suits your case (weights-bound → tensor parallel; optimizer-bound → ZeRO).

Question 7

What is ZeRO / FSDP and why does it matter for training?

Accepted Answer

ZeRO (Zero Redundancy Optimizer) and FSDP (Fully Sharded Data Parallel) shard the training state — optimizer states, gradients, and optionally weights — across the data-parallel GPUs instead of replicating them on each. Since optimizer states dominate training memory (~12 of the ~16 bytes per parameter), sharding them across N GPUs cuts per-GPU memory dramatically, making it possible to train models that would never fit otherwise. This calculator flags when optimizer states dominate and recommends ZeRO/FSDP, which is the standard solution for large-model training memory.

Question 8

How many GPUs do I need to serve a large model?

Accepted Answer

Enough total HBM to hold the weights plus the KV cache plus workspace, at minimum — and often more for throughput. A 70B model in fp16 (140 GB weights) needs at least two 80 GB GPUs just to fit; quantized to int4 (35 GB) it fits on one. Adding concurrent users or long contexts grows the KV cache and may require more. This calculator gives the minimum GPU count to fit; production serving sizes up from there for latency and throughput, which the LLM serving calculator addresses.

Question 9

What precision should I use for deployment?

Accepted Answer

For inference, bf16/fp16 is the quality baseline; int8 typically loses little accuracy with modern quantization and halves memory; int4 quarters memory with a small, often acceptable quality cost using methods like GPTQ/AWQ. For training, bf16 mixed precision is standard (with fp32 master weights). The right choice balances memory/cost against accuracy for your task — many production deployments use int8 or int4 to fit larger models on fewer GPUs. This calculator lets you compare the memory and GPU count across all precisions to inform that trade-off.

Question 10

How accurate is this memory estimate?

Accepted Answer

The weight and optimizer-state calculations are exact for the precision and method (mixed-precision Adam) modeled, and the KV cache formula is the standard one. Real usage adds framework overhead, fragmentation, temporary buffers, and activation memory that varies with implementation (activation checkpointing reduces it), so leave headroom — fitting to 100% of HBM will fail in practice. Use this for first-order feasibility and GPU-count planning; profile the actual deployment for precise memory. The conclusions — weights vs one GPU, quantization savings, training's 8× overhead — are robust.

Question 11

Does this tool send my data anywhere?

Accepted Answer

No. All memory math runs entirely in your browser in JavaScript — nothing is uploaded and there's no telemetry.

Model Fit Checker

Memory budget console

Why memory, not compute, is the first wall

Counting bytes before FLOPs

Model Fit FAQs

Trusted by ML Systems & Deployment Teams

Related tools

Similar Calculators

Inference Cost Calculator

Training Cost Calculator

GPU Cluster Sizing

HBM Bandwidth Calculator

AI Chip Comparator

Token Cost Estimator

Often Used Together

Wafer Cost Calculator

Die Per Wafer Calculator

Yield Calculator

Chip Profitability Calculator

Related Articles

Technical Services