You might have heard that we can simply keep making large language models bigger to get them smarter. It sounds like a straightforward path: throw more parameters at the problem, and performance improves. But there is a hard stop coming up fast. We are hitting a physical wall.
The promise of Large Language Models relies on scaling laws-the idea that model capability grows predictably with compute power. However, the hardware required to run these models faces immutable physical limits. From the silicon inside your graphics card to the electricity grid powering entire cities, every layer of infrastructure is straining under the weight of modern AI. If you are trying to scale an LLM, you aren't just fighting code; you are fighting physics.
The Memory-Compute Imbalance
The biggest bottleneck right now isn't how fast your processor can calculate numbers. It's how fast it can move those numbers around. This is known as the memory-compute imbalance. Modern GPUs like the NVIDIA H100 or the newer H200 are incredibly powerful calculators. They can perform trillions of operations per second. But they have a limited amount of High-Bandwidth Memory (HBM).
An H100 has 80 GB of HBM. The H200 bumps this to 141 GB. Sounds like a lot? Not when you consider what a large model needs. To train a 70-billion parameter model in full precision (FP32), you need roughly 280 GB of memory just for the weights, optimizer states, and gradients. You cannot fit that into a single chip. Even if you could, the data wouldn't move fast enough.
Think of it like a kitchen. The GPU is the chef. The memory is the counter space. If the chef is super fast but the counter is tiny, the chef spends most of their time waiting for ingredients to be brought in from the pantry (the main system RAM). In AI terms, the compute units sit idle while waiting for data. This makes the system "memory-bound" rather than "compute-bound." No matter how much faster you make the processor, if the memory bandwidth doesn't increase proportionally, you hit a ceiling.
| GPU Model | HBM Capacity | Memory Bandwidth | Max Model Size (FP16 Approx.) |
|---|---|---|---|
| NVIDIA A100 | 80 GB | 1.9 TB/s | ~40 Billion Parameters |
| NVIDIA H100 | 80 GB | 3.35 TB/s | ~40 Billion Parameters |
| NVIDIA H200 | 141 GB | 4.8 TB/s | ~70 Billion Parameters |
This table shows why researchers are forced to use tricks like quantization (reducing precision to FP16 or INT8) or distributed sharding. These techniques save space but introduce their own problems, like slower processing speeds or increased complexity in managing data across multiple chips.
The Power and Thermal Wall
Even if you solve the memory issue, you will likely burn down your building before you finish training. Power consumption is no longer a side effect; it is a primary design constraint. An NVIDIA H100 GPU can draw up to 700 watts under full load. That is nearly twice the power of a high-end desktop PC, packed into a single card.
Now scale that up. A standard AI cluster might contain 1,024 GPUs. At 700 watts each, that is 700 kilowatts of continuous power draw. Just for the compute nodes. Add in networking equipment, storage servers, and cooling systems, and the total facility power requirement skyrockets. Many existing data centers are capped by their electrical contracts. They physically cannot plug in more servers because the local grid cannot deliver the current.
Then there is heat. Dissipating 700 watts per GPU requires serious engineering. Air cooling is reaching its limit. Most new AI facilities are switching to liquid cooling, where cold plates attach directly to the GPU dies. This technology works, but it is expensive. Liquid cooling systems can cost over $50,000 per cabinet. This adds a massive capital expenditure hurdle. You aren't just buying chips; you are building a thermal management plant.
Network Interconnect Bottlenecks
When one GPU isn't enough, you link many together. This is called distributed training. But linking GPUs introduces a new enemy: communication latency. Inside a single server, GPUs talk to each other via NVLink, which offers up to 900 GB/s of bandwidth. That is fast.
But once you go between servers (cross-node), you drop to technologies like InfiniBand, which typically offers around 200 GB/s or less depending on the generation. When you are training a massive model, every GPU needs to synchronize its gradients with every other GPU thousands of times per second. If the network is slow, the GPUs spend more time talking than calculating.
This creates a "communication bound" scenario. As clusters grow larger, the ratio of computation to communication worsens. Researchers using frameworks like MoE-Lens have found that optimizing this communication overhead is critical. Their 2025 research showed that careful modeling of these bottlenecks could improve throughput by up to 25.5x. But even with optimization, the physical speed of light and the capacity of copper or fiber cables impose a hard limit on how efficiently you can scale out.
Mixture of Experts: A Partial Fix?
To cope with these constraints, the industry has moved toward Mixture of Experts (MoE) architectures. Instead of activating every neuron in a model for every input, MoE routes specific inputs to specialized "expert" sub-networks. This means you can have a model with trillions of parameters, but only activate a fraction of them for any given query.
On paper, this looks like a magic bullet. It reduces the compute load per request. However, MoE introduces new hardware challenges. Expert routing requires dynamic decision-making, which adds latency. More importantly, it causes load balancing issues. Some experts get hammered with traffic while others sit idle. This uneven distribution makes it harder to utilize the full power of your GPU cluster efficiently.
Furthermore, MoE increases communication overhead. Because different parts of the model live on different devices, moving data between experts requires more network traffic. So while MoE helps you fit larger models into existing hardware, it shifts the bottleneck from pure compute to memory access patterns and network congestion. It is a trade-off, not a solution.
Economic Limits and Scaling Laws
All these technical constraints boil down to money. Hardware costs are astronomical. An H100 GPU costs approximately $40,000. Training a frontier model like GPT-4 is estimated to have cost between $100 million and $1 billion in compute resources alone. But the GPU is only part of the bill.
Infrastructure-power, cooling, networking, and facility construction-eats up 30-40% of the total budget. This means for every dollar you spend on actual compute, you spend another 40 cents keeping it running. As models grow, the marginal cost of adding more parameters rises exponentially. You hit a point where doubling the model size requires quadrupling the budget, but yields diminishing returns in performance.
This economic reality forces a shift in strategy. We can no longer rely solely on brute-force scaling. The next era of AI development will depend on algorithmic efficiency-getting more intelligence out of fewer parameters-and architectural innovations that respect these physical boundaries.
Inference vs. Training: Different Problems
It is also important to distinguish between training and inference. Training is about learning; it is batch-heavy and can tolerate some latency. Inference is about serving users. Here, the constraint is latency. Users expect answers in milliseconds.
A single H100 might serve 100-200 concurrent users for a 70B model at acceptable speeds. To serve 1 million users, you need 5,000 to 10,000 GPUs. This scales linearly in theory, but in practice, managing state across thousands of nodes introduces synchronization delays. Additionally, longer context windows (like 100k+ tokens) increase memory usage quadratically due to the attention mechanism in transformers. This means serving long-context requests is significantly more expensive and slower than short ones, creating a tiered service problem where high-quality, long-context interactions become prohibitively costly.
Why can't we just add more RAM to GPUs?
You can't simply swap in standard DDR5 RAM like a desktop PC. GPUs use High-Bandwidth Memory (HBM), which is stacked vertically on the chip package to reduce distance and increase speed. Increasing HBM capacity requires redesigning the entire chip die and packaging process, which is extremely difficult and expensive. There are physical limits to how many layers you can stack before heat and signal integrity become unmanageable.
What is the biggest bottleneck for LLM inference?
For inference, memory bandwidth is usually the primary bottleneck. The model weights must be loaded from VRAM into the compute units for every token generated. If the memory bus is saturated, the GPU sits idle waiting for data. This is why techniques like KV-cache optimization and quantization are so popular-they reduce the amount of data that needs to be moved.
How does MoE affect hardware requirements?
Mixture of Experts (MoE) reduces the active compute per request but increases memory requirements and network communication. You need to store all expert weights in memory, even if only a few are used. Additionally, routing inputs to specific experts requires fast inter-GPU communication, making network bandwidth a critical resource.
Can liquid cooling solve the power problem?
Liquid cooling solves the *thermal* problem, allowing higher power densities without overheating. However, it does not solve the *energy* problem. The electricity still needs to come from somewhere. While it enables denser packing of GPUs, it does not reduce the total wattage consumed by the silicon. In fact, it may enable higher TDP designs, increasing total power draw.
What is the future of LLM hardware scaling?
The future likely involves a mix of specialized AI accelerators (like TPUs or custom ASICs) that are more efficient than general-purpose GPUs, better memory architectures (such as HBM3E or CXL), and algorithmic improvements that reduce the need for raw parameter counts. We will see a shift from brute-force scaling to efficiency-driven scaling.