Massive data center facility with power infrastructure

Goldman Sachs released new estimates this week projecting $200+ billion in AI infrastructure spending for 2025—data centers, chips, networking, power systems, everything needed to train and run AI models. But here's what the headline misses: roughly 70% of that spending flows through just three companies. NVIDIA gets chips. Microsoft/Amazon/Google get cloud. Everyone else fights over scraps.

The Numbers Behind the Boom

Hyperscalers (Microsoft, Amazon, Google) are collectively spending $130+ billion on AI infrastructure in 2025. That's data centers, custom chips, networking equipment, and power infrastructure. Individual numbers: Microsoft ~$50B, Amazon ~$45B, Google ~$35B.

NVIDIA will do $60-70 billion in data center revenue from AI chips alone. That's roughly 85% of the AI accelerator market. AMD has maybe $5-7 billion. Intel is trying but losing ground. Everyone else combined is rounding error.

The remaining $50-60 billion is fragmented across: cooling systems, networking gear, power infrastructure, data center construction, and various specialized components. Dozens of companies competing for pieces of that pie.

The Vertical Integration Race

What's driving the spending isn't just AI demand—it's vertical integration. Hyperscalers are building entire stacks from chips to software to avoid dependency on any single vendor (mainly NVIDIA).

Google has TPUs. Amazon has Trainium and Inferentia. Microsoft is developing its own accelerators. Meta has MTIA. All of them are trying to reduce NVIDIA dependence even though they're still buying massive quantities of NVIDIA chips.

This creates a weird dynamic where NVIDIA's biggest customers are simultaneously its biggest competitors in the future. Amazon is negotiating with OpenAI to use Trainium chips specifically to reduce NVIDIA's leverage. That should terrify NVIDIA shareholders even as revenue hits record highs.

The Power Problem Nobody's Solving

Data centers need power. Lots of power. AI workloads consume 3-5x the power density of traditional cloud computing. Training frontier models requires tens of megawatts continuously for months. Inference at scale needs similar capacity.

U.S. power grids weren't designed for this. Data centers are locating near available power, not near population centers. That's why xAI built in Memphis (TVA has capacity), why northern Virginia is packed with data centers (cheap power), and why companies are investing in nuclear SMRs and gas turbines.

The infrastructure bottleneck is shifting from compute to power. You can manufacture more chips faster than you can build power plants. Google, Amazon, and Microsoft are all signing 10-15 year power contracts and investing directly in generation capacity.

The Chip Supply Constraint

NVIDIA can't manufacture chips fast enough. TSMC's advanced nodes are booked out 18+ months. Packaging capacity for HBM (high-bandwidth memory) is limited. Supply chain constraints mean even if you can afford NVIDIA's premium prices, you might not get chips when you need them.

That's driving the vertical integration push. If you can't reliably get NVIDIA chips at the scale you need, building your own starts making sense despite the massive upfront R&D investment.

Custom chips also provide economic benefits. Google claims TPUs deliver better price-performance for their specific workloads. Amazon says similar things about Trainium. Even if those claims are somewhat exaggerated, the strategic value of supply chain independence is worth the investment.

The Software Lock-In Problem

NVIDIA's real moat isn't hardware—it's CUDA. The entire AI software ecosystem is built on CUDA. Libraries, frameworks, tools, developer knowledge—all NVIDIA-centric. Switching to alternative accelerators means rewriting substantial portions of your ML stack.

That's why AMD struggles despite having competitive hardware. Their ROCm software stack isn't as mature as CUDA. Tools don't work as well. Documentation is weaker. Developers don't have the same expertise. The switching costs are massive even if the chips are comparable.

Hyperscalers can absorb those costs because they have ML engineering teams. Startups and smaller companies can't. So NVIDIA maintains market dominance even as alternatives emerge.

The OpenAI-Amazon Deal Context

Amazon's reported $10 billion investment in OpenAI with requirements to use Trainium chips is Amazon trying to break NVIDIA's CUDA lock-in at the highest profile target possible. If OpenAI—the most prominent AI lab—runs on Trainium, that validates the chips and the software stack.

Other companies would follow. "If OpenAI can do it, we can do it." That's worth $10 billion to Amazon if it works. If it doesn't, they wasted money on an investment they'll regret when NVIDIA's dominance persists.

The Marginal Cost Collision

Here's the uncomfortable truth: AI inference costs are dropping fast. Models are getting more efficient. Custom chips are getting better. Competition is increasing. The marginal cost of running AI workloads is approaching the marginal cost of traditional cloud computing.

That's terrible for everyone who invested billions in AI-specific infrastructure. If inference becomes commoditized, the $200 billion infrastructure buildout doesn't generate the returns everyone expects.

Training frontier models will remain expensive, but training is a one-time cost. Inference runs continuously and generates most revenue. If inference margins compress, the business case for massive infrastructure spending weakens.

The Geopolitical Dimension

China is building AI infrastructure at comparable scale despite U.S. export restrictions on advanced chips. They're using older-generation chips, algorithmic efficiency improvements, and domestic alternatives. It's working well enough to be competitive.

That means U.S. companies are in an arms race not just with each other but with Chinese tech giants. Hence the $200 billion spending—it's not just about building capacity, it's about maintaining technological lead.

The problem is that lead is shrinking. China's infrastructure investment is comparable, their algorithmic innovations are catching up, and they have advantages in power availability and regulatory flexibility that the U.S. doesn't.

The Bubble Question

Is $200 billion in AI infrastructure spending rational, or is it a bubble? The bull case: AI will transform everything, demand will grow exponentially, this infrastructure will generate returns for decades.

The bear case: current AI capabilities are overhyped, demand won't justify the investment, much of this infrastructure will be stranded assets when the bubble pops.

The truth is probably somewhere in the middle. Some of this spending is rational investment in genuinely useful technology. Some is competitive pressure to not fall behind. Some is executives fearing they'll look stupid if they don't invest and AI does take off.

The Three-Company Problem

The concentration is the real story. NVIDIA, Microsoft, Amazon. Those three companies control the infrastructure layer for nearly all AI development globally. That's dangerous from a competition perspective, a national security perspective, and an innovation perspective.

If NVIDIA decides to prioritize certain customers or raises prices, entire segments of the AI industry are affected. If Amazon or Microsoft de-prioritize certain workloads, companies dependent on their clouds have limited alternatives. That's a lot of power concentrated in very few hands.

Regulators are noticing. Antitrust scrutiny of AI infrastructure deals is increasing. The question is whether regulatory intervention happens fast enough to prevent lock-in, or if by the time rules get written, the market structure is already entrenched.

My Take

$200 billion in infrastructure spending might be justified if AI delivers on its potential. But the concentration of that spending into three companies and the marginal cost compression happening in inference suggests we're building excess capacity that won't generate expected returns.

NVIDIA's dominance feels less secure than their stock price suggests. They have one or two more years of supply constraints protecting margins, then alternatives become viable and customers start diversifying. That doesn't kill NVIDIA, but it hurts growth.

The hyperscalers building their own chips are making the right strategic move even if it's expensive. Dependency on NVIDIA is untenable long-term. Better to invest billions now in developing alternatives than pay NVIDIA's premiums forever.

What worries me is the power situation. We're building massive AI compute capacity without equivalent investment in power generation. That mismatch will bite us—either through brownouts, infrastructure failures, or regulatory constraints that limit data center expansion.

The next 2-3 years determine whether this $200 billion buildout was visionary or wasteful. If AI adoption matches expectations, it'll look smart. If AI capabilities plateau or demand doesn't materialize, it'll be one of the biggest capital misallocations in tech history.

Place your bets.