Compute

Compute is the capacity to do the math that trains and runs AI models. At the frontier, it is an industrial system made up of specialized chips, fast memory, servers, networking, storage, power delivery, cooling, buildings, software, and skilled people. It converts data and electricity into trained models and deployed services.

The raw ingredients of modern AI are data, algorithms, hardware, electricity, and human expertise. Data provides examples to learn from. Algorithms define the model and how it learns. Hardware executes the calculations. Electricity powers the machines. People design the systems, run the training, and operate the infrastructure.

Hardware is the largest capital component. Frontier systems use high-end data center accelerators, typically GPUs or custom AI chips. Public price references for leading accelerators are often about $30,000 to $50,000 per unit, though actual deals vary. Multi-accelerator servers usually cost roughly $400,000 to $600,000. Large clusters can reach tens of thousands to over 100,000 accelerators; at rough public pricing, 100,000 accelerators implies about $3 billion to $5 billion in chips alone. Networking gear, switches, optical links, storage, racks, and integration can add billions more.

Power and cooling impose hard limits. A single leading-edge accelerator can draw roughly 700 to 1,000 watts at peak. A 100,000-accelerator cluster therefore implies roughly 70 to 100+ megawatts for accelerators alone, with total facility demand materially higher after adding CPUs, networking, storage, and cooling. These sites often require dedicated substations and advanced liquid cooling.

For frontier training runs, amortized hardware cost—especially the accelerators—is typically the primary cost driver. With depreciation over roughly three to five years, hardware amortization commonly accounts for about 50% to 70% of total training cost. Electricity is usually the next largest category at about 10% to 30%, depending on power prices and runtime. Facilities, networking, storage, and labor make up the remaining share. For a rough $500 million training program, it is plausible that $250 million to $350 million reflects amortized hardware and $50 million to $150 million reflects electricity, with the remainder spread across other costs.

Human talent converts capital into results. Research scientists design model architectures and training methods. Applied researchers adapt models to specific domains and products. Data engineers collect, clean, and manage large training datasets. Distributed systems and training engineers make thousands of chips work together reliably. Performance engineers optimize code and memory use to increase utilization. Infrastructure engineers build and operate clusters and capacity planning systems. Reliability engineers monitor failures and maintain uptime. Inference engineers deploy models into production systems that meet latency and cost targets. Senior staff in major labs often earn from the mid six figures into the millions annually, and total technical payroll for a large frontier program can reach tens to hundreds of millions per year.

Software makes the hardware usable. Low-level drivers and math libraries allow accelerators to run efficiently. Machine learning frameworks define models and compute gradients automatically. Distributed training systems coordinate thousands of devices and handle synchronization and failures. Scheduling software allocates workloads. Inference systems batch requests, route traffic, and apply safeguards. Inefficient software directly increases effective cost because depreciation and energy continue regardless of utilization.

Training begins with collecting and preparing data. Researchers define the model and initialize parameters. In each step, the model processes a batch of data, produces predictions, measures error, computes gradients, updates parameters, and synchronizes updates across devices. This loop repeats for weeks or months, with checkpoints saved regularly and metrics monitored for stability. After large-scale pretraining, models are often further refined with curated data or feedback-based techniques.

Inference uses the trained model. An input is tokenized, passed through the model in a forward computation, and output tokens are generated step by step. In production, this runs continuously and must meet latency and reliability targets, making utilization and power efficiency important.

Bottlenecks include accelerator supply, memory bandwidth, network scaling, hardware failures at large cluster sizes, power availability, and data quality. Trends point toward larger and denser clusters with multi-billion-dollar capital expenditure per site. While cost per unit of capability declines due to hardware and algorithmic improvements, total spending rises because models are larger and usage expands. Across frontier training economics, amortized chip cost remains the central driver, with electricity the next major component.

Comments

Leave a Reply Cancel reply