The real limit on data is not how much text exists but how much new information each training example provides. Most online text repeats patterns the model has already absorbed, so the effective training set is far smaller than the raw token count. As models improve, they predict common patterns so well that those tokens no longer reduce uncertainty about anything the model doesn’t already know. What matters now is access to high-signal data—expert workflows, step-by-step reasoning traces, interactions where people explain how they think, correct errors, or plan over multiple steps. Synthetic data helps only when it exposes reasoning paths or tasks the model has not yet mastered, rather than re-generating things it already predicts. The durable constraint is that we are running out of novel, informative examples per token, not out of text itself, and each new bit of genuine information becomes harder to find as models grow more capable.
The limitation on compute is not how many operations a chip can perform but how efficiently thousands of chips can work together without wasting time waiting on each other. When a model is split across many devices, those devices must constantly exchange parameters, gradients, and intermediate results. This “all-reduce” communication grows quickly as you add more chips, and eventually moving data between devices takes more time than doing the math itself. Techniques like better parallelism schemes, selective activation of parts of the model, or improved memory layouts can reduce the waste, but none remove the basic physics: inter-chip communication speed improves slowly, while raw arithmetic improves fast. The durable constraint is that beyond a certain cluster size, each additional chip adds very little real training progress because the system is bottlenecked by moving information, not by a lack of FLOPs.
Frontier training has become a physical-infrastructure problem as much as a research problem. A large training run requires massive and stable electricity supply, heavy cooling systems to remove heat, and high-bandwidth fiber networks—all of which take years to build. Substations, transformers, and transmission lines have 2–6 year lead times; datacenter construction and permitting can take just as long; water or advanced air-cooling systems require local environmental approvals; and the site must be close enough to fiber backbones to support low-latency communication. These constraints move on civil-engineering timelines, not semiconductor timelines. Even if algorithms become dramatically more efficient, the total demand for power, cooling, and specialized sites scales with the economic value of larger models. The durable constraint is that you cannot scale training faster than you can expand the physical grid and cooling infrastructure that support it—and those limits are set by construction speed and regional utilities, not by research breakthroughs.