Inference Cost Console
A model is trained once but serves queries forever — so inference cost dominates. Compute the cost per inference (hardware ÷ throughput, plus egress overhead), per 1,000 inferences, and the monthly bill, in any currency.
Throughput, hardware cost & utilization → cost per inference.
Inference unit-economics console
Compute is 3% of the per-inference cost. The hardware serves 0.58M inferences/hour at 80% utilization.
At 200 inf/s and 80% utilization, the hardware ($2.04/hr) serves 0.58M inferences/hour, so each costs $0.0001 ($0.1035/1k). At 50M queries/month that's $5,177.
Doubling utilization or throughput roughly halves the per-inference cost — the main levers, since the hourly cost is fixed.
For LLM token-based pricing use the Token Cost console; size the fleet in LLM Serving.
Currency conversion uses indicative rates — verify against a live source for contracts.
Why inference is the real bill
A model is trained once but serves queries for its whole life, so the cumulative inference cost dwarfs training. Cost per inference, multiplied by billions of queries, is the real spend.
Per-inference cost is the hardware's hourly cost divided by the queries it serves — so an under-loaded GPU has an expensive per-query cost. Throughput and utilization are the levers.
Network egress, storage, and load-balancing add a per-query cost on top of compute that's negligible per call but real across billions — the part that surprises a naive compute-only estimate.
Pricing, budgeting and unit economics for inference run in cost per thousand (or million) inferences. Computing it from throughput and hardware cost is the basis of every inference business case.
The bill that never stops
Training a model is a dramatic, one-time expense; serving it is a quiet bill that arrives with every single query, forever. For any successful AI product the cumulative inference cost overtakes training quickly and then keeps growing with usage — which is why the operating metric that matters most isn't the training run, it's the cost per inference, multiplied by the billions of queries a deployed model handles.
That cost has a simple core: the all-in hourly cost of the serving hardware divided by the number of inferences it produces in that hour. Because the hourly cost is essentially fixed, the denominator is everything — throughput and utilization. A well-batched, fully-loaded accelerator spreads its cost over enormous query volume and drives the per-inference cost down; an under-utilized one pays the same hourly cost for far fewer queries, and each one costs more. Keeping inference hardware busy is the heart of cheap serving.
The part a compute-only estimate misses is overhead. Network egress, storage, load balancing — each is negligible on a single query but real across billions, and ignoring it understates the true cost. A complete per-inference figure adds that overhead on top of compute, which is why this console separates the two and shows compute's share: when overhead becomes a meaningful slice, it's a signal to optimize data movement, not just the model.
Expressed per thousand or per million inferences, this is the unit every inference business case runs on — pricing, budgeting, and margin all derive from it. For generative LLMs where output length varies, the natural unit is the token instead — use the Token Cost console — and size the serving fleet that sets your throughput and utilization in the LLM Serving console.
Trusted by Inference Economics & Product Teams
“Cost per 1k inferences from hardware cost ÷ throughput, with utilization as the hinge, is exactly the operating number our pricing rests on. Including egress overhead beyond compute is the part naive estimates miss. Seeing it in euros and dollars settles cross-region unit economics.”
“The inference-dwarfs-training framing is the truth that justifies our serving-optimization roadmap. Per-query cost falling with batching/utilization is the lever, and this quantifies it. Pairs perfectly with the token-cost and accelerator-ROI tools for the full cost picture.”
“Clean per-inference and monthly cost with the compute-vs-overhead split. The utilization sensitivity is the reality check for our autoscaling. Would love cold-start and demand-variability modeling, but as a unit-economics tool it's exactly right.”
“Cost per thousand inferences is the unit we budget and price on, and this computes it honestly with overhead. Multi-currency matters for our global product. The vision-model preset matches our measured cost closely. Excellent.”
Love using our calculator?
Related tools
Similar Calculators
More tools in the same category
Training Cost Calculator
Calculate AI model training expenses including GPU cluster rental, data transfer, checkpoint storage, and engineering time with distributed-training overhead modeling. Supports LLM, vision, and multimodal training with FLOPs-to-cost mapping and carbon-footprint estimation.
GPU Cluster Sizing
Determine optimal GPU cluster configurations for training and inference workloads with interconnect topology modeling, memory-bandwidth balancing, and fault-tolerance planning. Supports NVIDIA, AMD, and custom accelerator clusters with InfiniBand and NVLink network analysis.
Model Fit Checker
Verify whether AI models fit within hardware constraints including GPU HBM capacity, on-chip SRAM, and interconnect bandwidth with layer-wise memory profiling. Supports model parallelism, pipeline parallelism, and ZeRO optimization recommendations for large-model deployment.
HBM Bandwidth Calculator
Estimate memory bandwidth requirements for AI workloads with operation-type analysis, data-movement profiling, and roofline model integration. Calculates HBM generation selection, channel count, and clock-speed requirements to eliminate memory-bound bottlenecks.
AI Chip Comparator
Compare AI accelerators across performance, cost, power, and software-ecosystem metrics with normalized benchmarking for training and inference workloads. Supports NVIDIA, AMD, Intel, Google TPU, Amazon Trainium, and custom ASICs with TCO-per-FLOP analysis.
Token Cost Estimator
Calculate infrastructure costs per token generated for LLM serving with batch-size optimization, KV-cache management, and speculative decoding impact. Models pricing for API providers and self-hosted deployments with demand-spike handling and multi-model routing.
Often Used Together
Complementary tools for complete analysis
Related Articles
Dive deeper with our expert guides and tutorials related to Inference Cost Calculator
cost/inference = (GPU $/hr + power) ÷ (throughput × 3600 × util) + overhead · per 1k = ×1000 · Last reviewed: 2026-06