The Golem in the Server Rack

Author's note: I build software, not chips, but you don't need to be a silicon fabricator to see when an industry is pouring concrete for the wrong building. The trillion-dollar bet currently being placed on hyperscale GPU infrastructure is the wrong bet, and what makes the claim defensible isn't a hot take — it's the alternative the same companies are already, quietly, building for themselves. More on that below.

The dominant cultural frame for AI is the Terminator/Frankenstein story: we built something that's going to wake up, develop a will, and turn on us. It's a compelling myth. It's also the wrong reference for understanding what's actually sitting in the server racks.

A better folk reference is the Golem of Prague. In the legend, the Golem is a creature shaped from clay and animated by a script inscribed on its forehead. It has no will, no malice, no inner life. It is pure, relentless execution. If you tell a Golem to clean the house, it tears the load-bearing walls down to get at the dust behind them. The danger isn't that the Golem rebels. The danger is that it is dangerously obedient to imperfect instructions.

That framing matters, because it shifts the conversation from "will the model become conscious" to "what is this thing actually doing, and what does it cost to do it." Both questions are worth asking. Only the second one is answerable today — and the answer ought to be making investors very nervous.

The Capital Expenditure Bet

Right now, somewhere north of a trillion dollars in capital expenditure is being committed across Microsoft, Meta, Google, Amazon, Oracle, OpenAI, and the rest over the 2024-2028 horizon — Microsoft has guided to roughly $80 billion in FY25 AI infrastructure spending, Meta finished 2025 at $72 billion and has since raised its 2026 guidance to $125-145 billion, and the OpenAI/Stargate consortium signalled $500 billion across four years when announced in January 2025. (The Stargate number is worth flagging as already-moving: the flagship Abilene expansion was cancelled in early 2026, several senior infrastructure leaders left for Meta, and OpenAI has since announced an eight-year, $100 billion AWS deal alongside contracts with Google Cloud and CoreWeave. The CapEx isn't as locked-in as the headline numbers imply, which is itself part of the story.) The bet rests on a specific assumption: that the path forward in AI is to buy as many high-end NVIDIA chips as possible, rack them into hyperscale datacentres, draw gigawatts off the grid, and rent access to centralised models.

The bet is going to age badly. Not because AI is overhyped — it isn't — and not because NVIDIA disappears or these datacentres go dark; they'll produce real compute and real revenue for years. The claim is narrower than that. The return-on-CapEx assumptions currently being modelled won't hold, because by the time the most ambitious of these datacentres finish commissioning, three things will likely already be true:

The chips inside them will have been outclassed on power and cost by purpose-built silicon — including, critically, silicon the hyperscalers themselves are already building.

The grid those datacentres depend on still won't have the headroom to run them at planned utilisation.

The workloads they were designed to serve will have been migrating, for years already, onto custom ASICs — Microsoft's Maia, AWS's Trainium, Google's TPU — with some of the next generation of that silicon running in low Earth orbit.

All three are happening already. Hold that for a moment; I'll come back to the details.

The Memory Wall

When the transformer architecture took off in 2017, the existing tool that happened to be good at large parallel matrix multiplication was the GPU. Researchers used what was on the shelf, NVIDIA was rewarded handsomely for it, and the path-dependency calcified from there.

In fairness to NVIDIA: modern datacentre chips like the H100 and B200 are not the gaming cards they evolved from. Most of the rendering pipeline has been stripped out and the silicon area is now dominated by tensor cores and HBM (high-bandwidth memory) tuned for AI workloads. The "graphics cards repurposed for AI" critique applied more to 2018 than to 2026.

But the architecture still inherits a real constraint. Large language model inference is autoregressive: the model produces one token, reads the whole context back, predicts the next token, repeats. For low-batch, latency-sensitive inference — the kind that dominates interactive chat workloads — that loop is memory-bandwidth bound, not compute bound. The expensive floating-point engines spend a meaningful percentage of every cycle waiting on the memory bus. The industry calls this the memory wall. Higher-batch serving recovers some of this efficiency, but at the cost of latency. The aggregate result is that even a fully-loaded GPU produces tokens at a rate that, divided by the wattage, leaves significant headroom for a purpose-built circuit that doesn't have to be programmable.

This is the constraint several startups are now trying to bypass. Cerebras and Groq are the best-known names doing it — wafer-scale and SRAM-deterministic designs respectively, both further along commercially than the three examples below. But they're both making variations of the same bet at the same layer: specialise the architecture, keep it broadly programmable. The three examples that follow are more interesting precisely because they bet on three different layers of the stack.

Bet One: Specialise the Architecture (Etched)

Etched is building Sohu, an ASIC that hard-codes the transformer architecture directly into silicon. No instruction set, no programmability — the attention mechanism is the chip. The company claims an 8-chip server can produce roughly 500,000 tokens per second on Llama 70B¹, which would put it well ahead of an equivalent GPU cluster on transformer workloads. TSMC 4nm, paired with 144GB of HBM3E, the same memory class as the B200.

The bet here is that transformers (or close cousins of them) will dominate AI for long enough to justify giving up all flexibility. If the bet pays off, the throughput-per-watt advantage is substantial. If a new architecture displaces transformers in the next three years, the chip becomes a paperweight. Sohu hasn't shipped to customers yet and no independent benchmarks exist — the performance numbers are still Etched's own.

Bet Two: Specialise the Model (Taalas)

Taalas² takes specialisation a step further. Instead of building a chip optimised for transformers in general, they bake an entire specific model into silicon. Their HC1 product hardcodes Meta's Llama 3.1 8B into an 815mm² die on TSMC's 6nm process. The weights aren't loaded from memory at runtime; they're physically present in the metal layers of the chip. No HBM, no CUDA, no liquid cooling. The model is the chip.

The reported numbers are striking: roughly 17,000 tokens per second per user at around 200 watts. Taalas claims a 20x reduction in build cost and 10x reduction in energy versus a comparable GPU setup, with a two-month turnaround to retool the chip for a new model by changing only the top metal masks.

The obvious tradeoff is brittleness. An HC1 is locked to Llama 3.1 8B forever. When a better 8B model drops — and one will — the silicon has to be respun. Taalas argues their respin cycle makes this viable as a seasonal hardware refresh. Whether the economics actually work at scale is unproven. But the physics — eliminate the memory transfer, eliminate most of the energy cost — is real, and the demonstration matters even if Taalas isn't the company that takes it mainstream.

Bet Three: Specialise the Geometry (Huawei)

The third bet doesn't try to specialise the workload at all. It tries to change how chips are physically built.

At ISCAS 2026³ in Shanghai earlier this week, Huawei's He Tingbo introduced the Tau (τ) Scaling Law, paired with a fabrication approach called LogicFolding. The pitch: instead of optimising for transistor size (the Moore's Law treadmill that requires EUV lithography China can't legally buy), optimise for signal propagation time. LogicFolding stacks logic circuits across two wafer layers connected by hybrid bonding at a 1.5-micrometre pitch. Signals that previously travelled hundreds of micrometres across a planar die now travel tens of micrometres vertically through the stack.

The reported result: a 55% increase in transistor density (155 to 238 MTr/mm²) without moving to a more advanced node. Kirin chips launching in autumn 2026 will be the first commercial silicon to use it. The 2031 target is 1.4nm-equivalent density.

Caveats apply. The numbers are not yet independently verified, the announcement is partly geopolitical (a workaround for export controls), and "equivalent density" via stacking is genuinely useful but isn't the same as a true sub-2nm planar process. But the direction matters. It says the gains that are still on the table aren't only in workload specialisation, but in chip geometry itself — and that's a much harder thing for NVIDIA to copy.

The Bet the Hyperscalers Are Already Placing

The clearest evidence that the GPU buildout isn't the destination isn't coming from startups in Toronto or Shanghai. It's coming from the hyperscalers themselves, with their own money, in production, right now.

In January 2026, Microsoft launched Maia 200 — a TSMC 3nm custom inference ASIC, 140 billion transistors, 216GB of HBM3e, deployed first in Iowa and Arizona.⁴ It is now running GPT-5.2 inference, Microsoft Foundry, and Microsoft 365 Copilot. Microsoft's published numbers claim roughly 30 per cent better performance per dollar than the previous generation of hardware in their own fleet — they have pointedly not published a direct NVIDIA comparison, but the chip exists precisely because the in-house math beat the in-house NVIDIA math. The same company committing $80 billion in FY25 to NVIDIA infrastructure is also routing its highest-volume production workloads off NVIDIA, onto silicon they designed in-house.

AWS is further along. Trainium2 launched in late 2025; AWS has deployed more than 500,000 chips in production and claims 30-40 per cent better price-performance versus comparable GPU instances.⁵ Project Rainier — the half-million-chip Trainium cluster activated in October 2025 — runs Anthropic's Claude workloads, training included. The notion that frontier model training is a NVIDIA monopoly is already historically false.

Google has been furthest along the whole time. Gemini Ultra was trained on a mix of TPU v4 and v5e, never on NVIDIA silicon. TPU v5p pods now coordinate at the 8,960-chip scale, more than double the previous generation. Gemini, serving billions of inference queries daily, has never depended on NVIDIA at either layer.

This is the part that ought to be alarming for the case that the trillion-dollar GPU buildout is the right bet. The hyperscalers are not sceptical bystanders watching ASIC startups from a distance. They are actively building the alternative silicon themselves, in production, at scale, paid for out of the same capital budgets that are also still buying NVIDIA Blackwells. They are running both bets simultaneously, and they are routing their own most expensive workloads — Copilot, Gemini, Alexa, Anthropic's Claude training — onto the silicon they built rather than the silicon they bought.

The mainstream industry reading of this is that custom silicon "supplements rather than replaces" GPU infrastructure. That reading is correct today, and is going to age the same way every other "supplements rather than replaces" prediction in technology history has aged.

The Orbital Footnote

The orbital story is a smaller piece of evidence pointing in the same direction, and it's worth being precise about it because it gets oversold.

In January 2026, SpaceX filed with the FCC to launch up to one million orbital data centre satellites. Musk's "AI Sat Mini" is sized to provide 100 kilowatts of AI compute per satellite, solar-powered, networked via laser mesh through the existing Starlink fleet.⁶ Google has separately confirmed Project Suncatcher, putting TPUs in low Earth orbit. Bezos's Blue Origin has its own variant. The IEEE Spectrum analysis of these proposals concludes that orbital compute might cost roughly three times its terrestrial equivalent in the best case — so this isn't a story of "orbit is cheaper." It's a story of "orbit is being seriously considered despite being more expensive, because the terrestrial constraints are bad enough that a 3x premium is starting to look reasonable.⁷" Those constraints are real: the median wait for large data centre grid interconnection in the US now runs five to six years, with Google reporting timelines up to twelve years for some sites.

A few honest caveats. The million-satellite figure is a regulatory ceiling request, not a deployment plan; SpaceX has asked the FCC to waive its normal deployment milestones precisely because they can't meet them. The whole pivot is contingent on Starship reaching reliable cadence, which it hasn't yet. And Google's TPU is not the same kind of narrow ASIC that Etched or Taalas are building — it's a general-purpose AI accelerator that sits somewhere between a GPU and a transformer-only chip.

What is still notable is the silicon choice. Aerospace engineer Andrew McCalip, whose costing the analysis draws on, describes it plainly: "you just start putting some radiation-resistant ASIC chips on the Starlink fleet and you start growing edge capacity organically."⁷ A 100kW satellite power envelope makes high-end NVIDIA GPUs a non-starter on watts per token, never mind radiation hardening. Whatever ends up in orbit — Google's TPUs, SpaceX's whatever-comes-next — will be more specialised than what the terrestrial datacentres are being built around. The same companies building the trillion-dollar GPU campuses are designing their orbital plans around silicon they explicitly cannot use in those campuses.

That's not a smoking gun. It's another data point in the same direction as Maia, Trainium, TPU, Etched, Taalas, and Huawei: when the engineering constraint actually bites, the answer is specialised silicon, not more general-purpose compute.

The Silicon Pattern

Every silicon-based technology I have ever watched ship has followed roughly the same arc. The first generation is bulky, expensive, fragile, and centralised. The mature generation is small, cheap, robust, and everywhere. There are exceptions outside of silicon — fusion is still twenty years away, supersonic passenger flight regressed, nuclear got more expensive over decades — but inside the domain of digital silicon, the pattern is remarkably consistent.

The drone is the cleanest recent example. A first-generation consumer drone in the early 2010s was a remote-control helicopter that got stuck in trees, drifted on the wind, lost signal at fifty feet, and crashed if you looked at it wrong. A 2025 consumer drone costs three hundred dollars, climbs to a kilometre, autonomously navigates around obstacles, returns home if the battery gets low, and is so reliable that hobbyists routinely fly them over a mile away without thinking twice.

Same arc with GPS: a backpack-sized military device in the 1980s; a $400 dashboard unit in the early 2000s; free, automatic, and accurate to the metre on a phone in your pocket today. Same with solar PV modules: over $75 per watt in 1977, around 30 cents per watt today.

There is no good reason to believe AI inference silicon is exempt from this pattern. The specialised hardware needed to bring it down the curve already exists in prototype form from at least three independent startups and in production from all three of the largest hyperscalers. The question isn't whether AI inference follows the silicon-commoditisation curve. The question is how soon, and how far up the workload stack it eats before stopping.

The Cost of a Bolt

It is worth distinguishing where this commoditisation actually hits, because "AI inference" is not one market. It is at least three.

The edge — AI capability shipping on phones, in cars, in industrial sensors — is already a commodity-silicon market. Apple's Neural Engine, Qualcomm's Hexagon, Samsung's NPU; this is the bolt world that already exists. Hundreds of millions of inference ASICs were sold last year, mostly invisible to anyone who doesn't read teardowns.

The mid-tier — the workloads served by mainstream chatbots, coding assistants, image generation, the bulk of what people actually use AI for through a screen — is where the migration is happening right now, on Trainium and Maia and TPU. This is the layer the trillion dollars is being spent on, and it is also the layer most exposed to displacement.

The frontier — 500B+ parameter agentic reasoning, video generation at Sora-class quality, multi-step scientific research workflows — genuinely still needs every joule of programmable flexibility a Blackwell can provide. Reasoning models that spend 100x more compute per query are a real trend, and they push the frontier upward as fast as the commodity end pushes from below. The trillion-dollar GPU CapEx makes sense for that top slice. It does not make sense, at this scale, for the much larger middle.

When inference stops being a rare, centralised commodity rented through an API, and starts being a line item that gets dropped into devices the way an accelerometer or a Bluetooth radio gets dropped in today, the middle of the market is the one that goes first. The frontier persists. The edge expands. The middle — currently the most expensive layer to run, and the one being built for — is the one that gets eaten.

A purpose-built inference ASIC, manufactured at volume on a mature process node, has no fundamental reason to cost meaningfully more than other commodity silicon in the same size class. Not next year — but on a five-to-ten year horizon, the cost of putting a useful, narrow AI capability into a product becomes the cost of a sensor, a regulator, a passive component. It becomes the cost of a bolt. You don't rent it. You don't subscribe to it. You buy it once, you drop it into your zero-trust network, and you own the execution.

That is the world the orbital filings, the Etched roadmap, the Taalas respin cycle, the Huawei Tau Law, and the Maia / Trainium / TPU programmes are all, separately, pointing towards. None of those organisations are coordinating. They are all responding to the same set of physical constraints from different angles, and the answer they are each converging on is some version of specialise the silicon, distribute the compute, and stop trying to brute-force the workload through a programmable general-purpose chip.

The trillion dollars of hyperscale GPU CapEx currently being poured is a transitional artefact. It will produce real revenue for some years, because the alternatives aren't yet at full scale and the existing software ecosystem is still built around CUDA. But it is not the destination. The destination is the commodity bolt at the bottom, the specialised hyperscaler ASIC in the middle, and the genuinely frontier GPU at the top — and the timeline for the middle to migrate is, I would bet, shorter than the timeline to finish building and powering the datacentres being constructed for the transitional artefact.

Where This Prediction Could Be Wrong

I want to be explicit that the argument above is a prediction, not a description, and predictions can fail. A few specific things would have to be true for the case to be wrong:

If the next five years look like the last five, the thesis is wrong. Cerebras and Groq have been making variations of this argument since 2018-2019 and NVIDIA's revenue has multiplied roughly tenfold over that period. Buyers have consistently chosen flexibility over peak efficiency, on the entirely rational grounds that they don't know which model they'll be running in eighteen months. The reason I think this time is different is that the hyperscalers themselves have now stopped waiting — Maia, Trainium, and TPU at production scale is a structural change, not another round of startup pitches. But it's a judgement call, and a reasonable person can read the same evidence and conclude the GPU buyers have priced this correctly.

If models keep getting structurally larger and reasoning workloads keep multiplying compute per query — and the trajectory of o-series reasoners, video generation, and multi-step agentic workflows all point that way — the frontier might keep outrunning the commodity curve. The bolt would still arrive. The frontier would just keep moving faster than the bolt could catch.

If CUDA's software moat proves stickier than the underlying silicon advantage justifies, migration to specialised hardware will be slower than the hardware progress would predict. Compilers like XLA are eroding that moat — but they have been eroding it for years, and most developers still live in CUDA.

If the orbital plans collapse — Starship doesn't reach reliable cadence, a million-satellite constellation doesn't get past regulators, the cost math gets worse than 3x — the orbital flank of the argument weakens. The hyperscaler-internal ASIC programmes still stand without it, but the broader pattern looks thinner.

The honest version: the cautious reading of the same evidence — hyperscalers will overspend somewhat, ASICs will take meaningful share of mid-tier inference, frontier training will stay on programmable accelerators for some time, edge AI will keep getting cheaper on commodity silicon — is the consensus, and it is probably also true. The disagreement between the cautious version and the argument I'm making is about magnitude and timeline, not direction. If I'm right, the trillion-dollar CapEx is the wrong shape and the wrong scale and ROI will compress hard. If the cautious version is right, the same CapEx is roughly the right size, GPUs hold the centre, and ASICs nibble around the edges. Both worlds are coherent. I think the first one is more likely. I'd want anyone reading this to know the second one is also a serious possibility, and to weight accordingly.

What the Golem Was Trying to Tell Us

The Golem story isn't really about clay. It's about the gap between what a mechanism does and what its operators think it does. The thing in the server rack is not a digital god, and it is not a soul. It is a very expensive piece of stochastic pattern-completion, currently being run on hardware that wastes most of its electricity moving numbers across a bus.

The Golem doesn't need to be smarter. It needs to be cheaper. And it is about to be.