GPU Cluster Sizing Console
Turn a GPU count into a datacenter build. Compute the nodes, racks, power and network — power, not floor space, is the constraint, and the interconnect is what makes the GPUs actually scale.
GPU count & topology → nodes, racks and power.
Cluster build console
1,024 GPUs is 128 nodes across 32 racks, drawing 909 kW IT (1.18 MW facility). At 37 kW/rack air cooling is feasible.
GPUs share an all-to-all NVLink domain within each 8-GPU node; a fat-tree of 128 inter-node links gives ~26 TB/s bisection for all-reduce. Under-provision it and scaling stalls.
Get the GPU count from Training Cost; cost the facility power in Data Center Power.
Why a cluster is power and network
A rack of kilowatt-class GPUs draws tens of kilowatts — far beyond traditional ~10kW racks. Modern AI datacenters are designed around power and cooling density, with far fewer servers per rack than legacy halls.
Training spreads a model across thousands of GPUs that must exchange gradients constantly. The network — NVLink within a node, InfiniBand or Ethernet between — determines whether the GPUs actually scale or stall on communication.
Within a node, GPUs share an all-to-all high-bandwidth NVLink domain; across nodes, a fat-tree of InfiniBand switches provides full bisection bandwidth. The two-tier hierarchy is the standard AI-cluster topology.
How much data can cross the middle of the network at once — the bisection bandwidth — sets the ceiling on collective operations like all-reduce. Under-provision it and adding GPUs stops helping.
From a pile of GPUs to a machine
A GPU count is not a cluster. Turning thousands of accelerators into a machine that trains a model is a problem of power and network far more than of the chips themselves, and getting either wrong wastes the silicon. The first surprise for anyone from traditional IT is power density: a rack of kilowatt-class GPUs draws tens of kilowatts, several times what legacy racks were built for, so AI datacenters are designed around delivering and removing that power — fewer servers per rack, liquid cooling, dedicated substations.
The second, deeper truth is that the interconnect is the cluster. Distributed training synchronizes gradients across every GPU on every step, so the machine is only as fast as its slowest communication path. The standard answer is a two-tier hierarchy: within a node, eight GPUs share an all-to-all NVLink domain at terabytes per second; across nodes, a fat-tree of InfiniBand switches provides full bisection bandwidth so any node can talk to any other at full rate simultaneously.
That bisection bandwidth — how much data can cross the middle of the network at once — is the number that gates scaling. Collective operations like all-reduce move data across the whole cluster, and if the network can't carry it, communication overwhelms computation and adding GPUs stops helping (the scaling efficiency, and the MFU, collapse). This is why frontier clusters spend enormously on networking, and why a sizing exercise must include the network, not just the node and rack counts.
This console scopes that build from a GPU count — nodes, racks, the power envelope, and a first-order bisection bandwidth. The count itself comes from the workload in the Training Cost console, the facility power flows into the Data Center Power console for energy and cost, and the own-vs-rent question is the Accelerator ROI console.
Trusted by Datacenter & HPC Infrastructure Teams
“Nodes, racks, power-per-rack and bisection bandwidth from a GPU count is exactly the first-pass cluster scope. The power-not-floor-space point is the one facility teams must internalize — 37kW/rack is a different building than legacy. The two-tier NVLink/InfiniBand model is right.”
“The 16k-GPU at 25MW figure is the utility-scale reality that drives our site selection. Bisection bandwidth gating all-reduce is the insight that justifies the InfiniBand spend. Pairs perfectly with the training-cost and data-center power tools.”
“Clean node/rack/power scoping with the NVLink-domain and fat-tree framing. Power-per-rack against our density limit is the check I run first. Would love switch-count and oversubscription modeling, but as a first-pass sizing tool it's exactly right.”
“We scope clusters off this before detailed design — node count, megawatts, racks. The interconnect-is-the-cluster framing reframes it from a pile of GPUs to a network. Feeds straight into the data-center power estimator. Excellent.”
Love using our calculator?
Related tools
Similar Calculators
More tools in the same category
Inference Cost Calculator
Estimate deployment costs for AI models across cloud, edge, and hybrid infrastructures with per-query, per-token, and per-hour pricing models. Integrates GPU/ASIC rental rates, network egress, storage, and scaling overhead for accurate inference TCO analysis.
Training Cost Calculator
Calculate AI model training expenses including GPU cluster rental, data transfer, checkpoint storage, and engineering time with distributed-training overhead modeling. Supports LLM, vision, and multimodal training with FLOPs-to-cost mapping and carbon-footprint estimation.
Model Fit Checker
Verify whether AI models fit within hardware constraints including GPU HBM capacity, on-chip SRAM, and interconnect bandwidth with layer-wise memory profiling. Supports model parallelism, pipeline parallelism, and ZeRO optimization recommendations for large-model deployment.
HBM Bandwidth Calculator
Estimate memory bandwidth requirements for AI workloads with operation-type analysis, data-movement profiling, and roofline model integration. Calculates HBM generation selection, channel count, and clock-speed requirements to eliminate memory-bound bottlenecks.
AI Chip Comparator
Compare AI accelerators across performance, cost, power, and software-ecosystem metrics with normalized benchmarking for training and inference workloads. Supports NVIDIA, AMD, Intel, Google TPU, Amazon Trainium, and custom ASICs with TCO-per-FLOP analysis.
Token Cost Estimator
Calculate infrastructure costs per token generated for LLM serving with batch-size optimization, KV-cache management, and speculative decoding impact. Models pricing for API providers and self-hosted deployments with demand-spike handling and multi-model routing.
Often Used Together
Complementary tools for complete analysis
Related Articles
Dive deeper with our expert guides and tutorials related to GPU Cluster Sizing
nodes = GPUs ÷ per-node · facility power = (GPUs×W + nodes×overhead) × PUE · bisection ≈ GPUs × NIC ÷ 2 · Last reviewed: 2026-06