The AI chip acronym soup of CPUs, GPUs, TPUs, etc., shows how the computing landscape continued to expand and change over the past decade. At Google Cloud Next, the company released two distinct TPUs (Tensor Processing Units) instead of one — TPU-8t, built for training, and TPU-8i, built for inference and the emerging demands of agentic workloads. The launch highlights an architectural decision that reflects how AI workloads are diverging, with real implications for how enterprise buyers should think about AI infrastructure strategy.
What Google Actually Announced
During a press and analyst session at Google Cloud Next, Amin Vahdat, Google’s SVP and Chief Technologist for AI Infrastructure, introduced the eighth-generation TPUs — and emphasized the plural intentionally. Vahdat said the two chips were designed from the ground up separately.
TPU-8t is the training workhorse. Compared to last year’s Ironwood generation, it delivers roughly three times the floating-point compute per pod, twice the network bandwidth per chip, and four times the bandwidth at scale-out — all with approximately the same pod size of 9,600 chips, but with denser, faster interconnects.
TPU-8i is the inference and agent engine. It quadruples the pod size to 1,152 chips, delivers 10x the FP8 compute, 7x larger HBM memory capacity, and offers bidirectional scale-out bandwidth. The design priority is latency, not just throughput — a meaningful distinction as enterprises move from batch processing toward real-time agentic workloads.
Vahdat put the pace of progress plainly: “2x, 4x, 8x, 10x all in one year — the rate of progress, the rate of advancement is just stunning.”
That’s impressive on paper. The more important question for enterprise buyers is what it means for how they plan and procure AI infrastructure.
The Specialization Signal
The two-chip decision acknowledges that training and inference have different physics.
Training is throughput-bound, which means you’re moving enormous amounts of data through interconnected chips in a coordinated, largely predictable batch process. Inference, especially for the new upcoming wave of agentic systems, is latency-bound. For this use case, chips need to respond in near-real time as agents plan, act, evaluate, and route across multiple tools and workflows.
To address the latency problem directly, Google and DeepMind collaborated on a new network “boardfly” topology for TPU-8i, designed to reduce the number of hops between any two chips, significantly cutting chip-to-chip latency. As Vahdat described it: “Our default way of connecting them didn’t support latency. It supported bandwidth. What you really care about in the age of agents is latency — the minimum time it takes to get the data.”
This mirrors a trend Jensen Huang surfaced at NVIDIA, where chip-to-chip connectivity is increasingly central to total system performance, not just an afterthought to compute specs. The implication: network topology is now a first-class variable in AI infrastructure design, not just chip count or memory.
Vahdat was direct about the broader trajectory: “The age of specialization is going to continue.” His prediction for the industry — not just Google — is that workloads will continue diverging, and two chips may eventually become more. General-purpose improvements, he noted, are now yielding roughly 5% annual performance gains normalized to cost. Specialization is how you get past that ceiling.
What This Means for Enterprise Buyers
Enterprise buyers don’t purchase TPUs. They consume AI services through public cloud, SaaS platforms running on cloud infrastructure, and increasingly through hybrid architectures spanning on-premises and cloud. There are at least three reasons why a chip announcement matters.
- AI infrastructure costs are becoming a material business decision. Google is running AI inference on TPUs across Search, YouTube, Gmail, and its enterprise Gemini services. The efficiency of that infrastructure directly affects the cost structure of AI-powered services. When Google cuts inference costs through better hardware, the economics of running AI at scale improve for Google and for its cloud customers. Citadel, the securities trading firm, was cited as a TPU customer that reduced costs 30% and achieved two to four times efficiency improvement on trading systems. Specialized hardware scales well beyond its original design targets.
- Inference is where AI delivers the most value to most enterprise buyers. For several years, we’ve been discussing the shift from large-scale frontier model training and enterprise AI fine-tuning towards inferencing. It’s finally here, and we have multiple ways to improve inference, including new TPUs designed for inference. As Vahdat noted, using a historical parallel to web search: the heavy lifting happens in training, but the value is created in serving. “Serving is where the value is created for Gemini enterprise and search, and ads and YouTube.” Enterprise AI budgets and infrastructure roadmaps need to weigh inference infrastructure proportionally to where value is actually produced.
- Reliability at scale is still an unsolved problem — and it matters. Vahdat was candid about a challenge the industry rarely advertises: at the scale of tens of thousands of chips working in coordination, at least one chip will fail several times per day. If human intervention is required to detect and recover from failures, the minimum response time is 30 minutes — enough to halt progress entirely. Google’s approach delivers over 97% of good computational throughput, enabling failures to be automatically detected and remediated. Still, Google Cloud said enterprises aren’t interested in any failure. For enterprises evaluating AI infrastructure providers, reliability and observability at scale are now table-stakes questions, not nice-to-haves.
The Agentic Infrastructure Shift Is Already Here
A surprising forward-looking element of Vahdat’s remarks was a prediction about CPUs. As agentic systems grow, general-purpose compute will make a comeback — not to replace specialized chips, but to orchestrate them. Agents require sandboxed environments, virtual machines, code execution, and dynamic routing across inference calls. He stated that it’s CPU work.
Enterprise infrastructure planners should take note: agentic AI isn’t just an inference problem. It’s a systems design problem that spans specialized accelerators, general-purpose compute, network topology, and increasingly, identity and governance layers sitting above the hardware. The companies Google cited as running on TPUs today — from its own consumer services to financial services firms — are already thinking holistically about infrastructure.
The infrastructure decisions enterprises make now will determine how quickly and cost-effectively they can deploy agentic systems at scale. Building on platforms engineered for latency, reliability, and specialization is a different starting point than building on platforms that aren’t.
Google Cloud’s eighth-generation TPUs are a signal that the advancement of AI infrastructure is far from over.







