The Chip That Ate Its Own Story

Trainium won training. It won prefill. It couldn't win decode. The person who sold the full-stack narrative just left.

Mar 26, 2026

Gadi Hutt, director of product and customer engineering at Amazon’s Annapurna Labs, has left the company. [1] The news, reported by The Information on March 26, landed thirteen days after AWS announced a partnership with Cerebras Systems to disaggregate inference — splitting prefill from decode across two vendors’ chips and routing it all through Amazon Bedrock. [2]

Whether the departure was triggered by the Cerebras announcement or was already in motion, the structural alignment remains the same.

I worked with Gadi and his team across three companies. At AWS, where I spent six years. At Hugging Face, where Inferentia and Trainium adoption were part of the ecosystem play I led as Chief Evangelist. And at Arcee AI, where making custom silicon work in production was a practical, daily question. Gadi ran engineering and solutions architecture teams responsible for making Annapurna’s chips work for customers, not just for pitching them. The story he carried held up for a long time: Trainium is a full-stack AI chip that competes with Nvidia on training and inference. One product. One pitch.

That story changed on March 13.

The pivot nobody is naming. The AWS-Cerebras deal is either smart engineering or a structural concession — and the org chart tells you which one Amazon thinks it is. [3] Trainium handles prefill, the compute-bound phase. Cerebras’s WSE-3 handles decode, the memory-bandwidth-bound phase where the model generates tokens sequentially. [4] David Brown, VP of Compute & ML Services, is the named spokesperson for the new architecture, not anyone from Annapurna’s product organization. [5]

Four days before the departure news, TechCrunch ran an exclusive tour of the Trainium lab in Austin. The guides were Kristopher King and Mark Carroll, engineering leadership. [6] Gadi, who had been the external face of Trainium at re:Invent, in Time, in Fortune, was already absent. The phase-out was complete before The Information made it official.

The org shift behind the architecture shift. Annapurna was not originally structured as a service team. It operated as an R&D center — designing chips and delivering them to AWS service teams, with its own leadership, customer relationships, and external voice. That changed roughly two years ago, when Brown’s compute and ML services organization absorbed the product and go-to-market layer that sat between Annapurna’s engineering and external customers. Then, in December 2025, Peter DeSantis was elevated to lead a unified org spanning AI models, custom silicon, and quantum, reporting directly to Jassy. [7] Each step moved Annapurna closer to becoming an internal component supplier rather than an independent product shop. By the time the Cerebras deal was announced, the organizational structure that had given Gadi his role — an Annapurna that spoke for itself — no longer existed. Amazon spent a decade acquiring Annapurna. It spent the last two years digesting it.

What the deal reveals. Gadi told Time last year that “Stargate is easy to announce — let’s see it implemented first.” [8] That confidence reflected the old story: AWS builds the chips, builds the servers, builds the datacenter, runs the whole stack. The Cerebras deal breaks it — the inference pipeline now runs on someone else’s silicon for its most demanding phase.

There are 1.4 million Trainium chips deployed across three generations; Anthropic’s Claude runs on over one million of them. [9] OpenAI has committed to two gigawatts of Trainium capacity — a commitment, not yet a deployment. [10] Trainium succeeded at training and at prefill. This is not a failure of the chip.

But AWS had made a consequential bet along the way: it discontinued Inferentia entirely. [11] The rationale was sound — Trainium1 was actually better at inference than Inferentia2. Inf2 was designed as a lower-cost chip, optimized to crush inference costs as a slower but more cost-effective alternative to GPUs. When your training chip outperforms your dedicated inference chip at inference, you consolidate. AWS did.

Then the market changed beneath the consolidated architecture. Agentic AI made inference the dominant workload — generating 15x more tokens per query than conversational chat [12] — and decode became the binding constraint on cost and latency. The Reasoning Tax breaks the monolithic chip by concentrating costs on the phase the chip handles worst. [13] Arm launched its first production CPU this week for “agentic AI infrastructure,” with Meta, OpenAI, and Cerebras as launch partners. [14] Trainium could win inference when it meant prefill. It could not win when it meant decoding at the reasoning scale. Whether killing Inferentia was yet another AWS miscalculation or the deliberate first step toward a disaggregated architecture, the result is the same: the full-stack chip story ended.

AWS was not the only one that noticed. Nvidia shipped Dynamo, the open-source framework for orchestrating disaggregated prefill and decode, which all four major hyperscalers are adopting. [15] Then at GTC 2026, four days after the Cerebras deal, Nvidia launched the Groq 3 LPX — its first rack built around a non-GPU chip. [16] Rubin GPUs handle prefill. Groq’s SRAM-based LPUs handle decode. Same split, different partners. When Nvidia and AWS reach the same architectural conclusion in the same month — one through a reported $20 billion Groq licensing deal, the other through Cerebras — that is not two companies making independent bets. That is an industry settling a technical argument. For reasoning-heavy inference at scale, the “one chip does everything” era ended in March 2026.

Gadi’s departure is the personnel signal that matches the architectural one. The positioning has shifted from “Trainium competes with Nvidia at the chip layer” to “AWS orchestrates the best inference pipeline using multiple silicon architectures” — a stronger market position, but one that requires a platform architect who orchestrates silicon from multiple sources, not a leader who owns the chip end-to-end.

For all my frustration and cursing at the Neuron SDK, I have a lot of respect for what Gadi and the Annapurna team built. The Inferentia-to-Trainium arc is the most ambitious custom silicon program any cloud provider has shipped, and the adoption numbers vindicate the engineering. [17] The departure is not a verdict on the person. It is a verdict on the narrative.

Watch where Gadi surfaces next. Another custom silicon company would signal frustration with a strategy he disagreed with. A platform or orchestration role would signal alignment with the industry’s direction. Either way, the person who embodied the full-stack chip story at AWS is the clearest evidence that this story is over — not just at AWS, but everywhere.

Notes

[1] The Information, “Amazon AI Chip Product Leader Departs,” March 26, 2026. Hutt’s title was Director of Product and Customer Engineering at Annapurna Labs.

[2] AWS press release, “AWS and Cerebras Collaboration Aims to Set a New Standard for AI Inference Speed and Performance in the Cloud,” March 13, 2026.

[3] Brown, quoted in [2]: the disaggregated architecture means “each system does what it’s best at.” That is legitimate engineering. It is also a departure from the Trainium-does-everything positioning that defined Annapurna’s external narrative for years. Both readings are correct; the org changes determine which one is operative.

[4] Cerebras, “Cerebras Is Coming to AWS,” March 13, 2026. In disaggregated mode, Trainium handles prefill (computing the KV cache), sent to the WS and sends i EFA. The WSE exclusively performs decode. The WSE-3 houses 44 GB of on-chip SRAM — no HBM — eliminating the memory-bandwidth bottleneck that constrains conventional GPU decode.

[5] Brown was the named AWS spokesperson in the March 13 joint announcement [2], the AWS Silicon Innovation Day keynote (2023), and the Peter DeSantis/Dave Brown infrastructure keynote at re:Invent 2025. His title evolved from VP Amazon EC2 to VP, Amazon EC2, to VP, Compute & ML Services, and Gadi Hutt ran engineering and solutions architecture teams at Annapurna and served as the external face of the chips at re:Invent 2022, Time (April 2025), and Fortune (April 2025). His departure removes both the engineering bridge and the customer-facing narrative from Annapurna’s product layer.

[6] TechCrunch, “An Exclusive Tour of Amazon’s Trainium Lab,” March 22, 2026. Tour led by Kristopher King (lab director) and Mark Carroll (director of engineering). Hutt absent.

[7] Andy Jassy, “Amazon Leadership Update,” aboutamazon.com, December 17, 2025. “I’ve asked Peter DeSantis to lead a new organization that drives our most expansive AI models (e.g. Nova—and the team we’ve called ‘AGI’), silicon development (e.g. Graviton, Trainium, Nitro), and quantum computing.” Jassy confirms DeSantis “spearheaded the acquisition of Annapurna Labs” in 2015 and “continues to manage that team.” The organizational absorption of Annapurna’s product and go-to-market layer into Brown’s compute org occurred approximately 2 years earlier — the author’s direct knowledge from the customer and partner sides, working with Annapurna across AWS, Hugging Face, and Arcee AI.

[8] Gadi Hutt, quoted in Time, “Inside Amazon’s Race to Build the AI Industry’s Biggest Datacenters,” April 2, 2025.

[9] TechCrunch, March 22, 2026. Company-reported figure: “1.4 million Trainium chips deployed across all three generations, and Anthropic’s Claude runs on over 1 million of the Trainium2 chips deployed.”

[10] AWS-Cerebras joint press release [2], March 13, 2026. “OpenAI will consume 2 gigawatts of Trainium capacity through AWS infrastructure.” This is a commitment, not a deployment. See also Jassy’s CNBC interview on the OpenAI-Trainium relationship.

[11] Next Platform, “With Trainium4, AWS Will Crank Up Everything But The Clocks,” December 3, 2025. “With Trainium2...AWS moved on to the NeuronCore-v3 architecture and stopped making Inferentia chips because inference started becoming more like training.” There is no Inferentia3. Practitioner context: Trainium1 was already outperforming Inferentia2 at inference. Inf2 was designed as a lower-cost, lower-performance chip optimized to reduce inference costs — a slower but cheaper alternative to GPUs, not a faster one. Consolidating around Trainium was the rational engineering decision given the performance gap. The question the Cerebras deal answers is which kind of inference Trainium wins: prefill (yes), decode at reasoning scale (no). Author’s direct knowledge.

[12] Cerebras [4], March 13, 2026. “Agentic coding generates approximately 15x more tokens per query.” Vendor-published figure; directionally consistent with Nvidia Dynamo documentation describing the same workload shift.

[13] Julien Simon, “AWS Built Its Own AI Chip. Now It Needs Someone Else’s,” The AI Realist, March 15, 2026. Introduces the Reasoning Tax framework, the Platform Absorption Test, and the three-ecosystem convergence (AWS-Cerebras, Nvidia-Groq, Huawei Ascend 950). Covers the Inferentia-to-Trainium naming evolution and discontinuation in detail. The present note is a personnel coda to that structural analysis.

[14] Arm Holdings, “Arm Expands Compute Platform to Silicon Products,” March 24, 2026. First production silicon in Arm’s 35-year history. 136 Neoverse V3 cores, TSMC 3nm, positioned for “agentic AI infrastructure.” Meta is lead co-developer; OpenAI, Cerebras, Cloudflare among launch partners. See also CNBC exclusive, “Arm Launches Its Own CPU, with Meta as First Customer,” March 24, 2026.

[15] Nvidia, “NVIDIA Enters Production With Dynamo, the Broadly Adopted Inference Operating System for AI Factories,” investor relations press release, March 16, 2026. Dynamo 1.0 is open-source, production-grade, integrated by AWS, Microsoft Azure, Google Cloud, and OCI. Jensen Huang: “With NVIDIA Dynamo, we’ve created the first-ever ‘operating system’ for AI factories.” See also Nvidia developer blog and glossary entry on disaggregated serving.

[16] Nvidia, Groq 3 LPX announced at GTC 2026, March 17, 2026. Rubin CPX GPU racks handle prefill; LPX handles decode. Based on a reported $20 billion licensing agreement with Groq — figure not confirmed by Nvidia in filings; if accurate, it would be material enough to require disclosure. Coverage: WinBuzzer, March 17, 2026. Nvidia’s own GTC materials confirm the product and disaggregated architecture.

[17] TechCrunch [6], March 22, 2026. Apple’s director of AI publicly described Apple’s use of Graviton, Inferentia, and Trainium at an AWS event. Anthropic and OpenAI commitments sourced in [9] and [10].

The AI Realist

The Chip That Ate Its Own Story

Trainium won training. It won prefill. It couldn't win decode. The person who sold the full-stack narrative just left.

Notes

Ready for more?