Lumen Photos | shutterstock.com
At the end of 2025, I was involved in a capacity planning effort for a global retailer. He had integrated a 70B model into his pipeline for product searches and recommendations. Each search query triggered an inference call. This ultimately led to a consumption of GPU hours during the Christmas season, which caused the finance team physical discomfort.
The company had already increased from 24 to 48 NVIDIA H100 GPUs – but latency still skyrocketed under peak load. At this point, I was called in to answer a simple question: Given the ongoing problems, will 96 GPUs be needed for the January sale – or is there something else at play here?
So I started where I always start with projects like this: with profiling. I instrumented the serving layer and broke down the usage data by inference phase. The result fundamentally changed my perspective on GPU infrastructure: During prompt processing – the phase in which the model reads all user input in parallel – the H100 GPUs ran at 92 percent compute utilization, the Tensor cores were fully utilized. That’s to be expected with a $30,000 GPU. However, this phase took around 200 milliseconds per request. The next phase, token generation, took three to nine seconds. During this time, the utilization of the (same) GPUs fell to 30 percent. The computing cores remained idle, but the memory bus was running at full speed to read out the attention cache.
H100 hourly rates were paid in full, but only delivered “peak performance” for around five percent of the total duration of each request. The remaining 95 percent was eaten up by a memory bandwidth problem.
Splitting workloads is one solution
LLM inference involves two workloads “pretending” to be one:
- Prompt processing (also known as “prefill”) is dense matrix multiplication that utilizes every core on the chip.
- Token generation (also “decode”) represents a sequential memory access that only requires a fraction of the computing power.
These two workloads alternate on the same hardware within the same scheduling loop. Although I have worked with carrier-scale Kubernetes clusters and high-throughput data pipelines, I had never seen such a bimodal workload profile on such expensive hardware before. If you were to operate a database in the same way, you would not hesitate to divide it into primary write servers and read replicas.
Unfortunately, most teams running LLMs have not yet recognized this connection. And monitoring tools don’t make things any better: every inference dashboard I looked at reported a singular value for “GPU utilization” – the average value of both phases. The dashboards essentially “hid” the bimodal distribution behind this number.
Researchers at UC San Diego’s Hao AI Lab have also encountered this problem – and have a proposed solution. This one goes by the somewhat unwieldy name “Disaggregated Inferenceand relies on setting up two GPU pools – instead of using one for both phases. This is:
- one designed for computational throughput (prompt processing), while
- the other focuses on memory bandwidth (token generation).
An upstream routing layer forwards every request to the right pool at the right time. These are transferred via the attention cache and a fast network connection.
When I first suggested this solution to the customer, he was initially skeptical. Finally, two pools also mean more complexity in operation. And: A cache transfer protocol ensures a network dependency that does not exist with monolithic serving. Legitimate objections – so I pointed out to him who was already using this method in practice.
- Perplexity has built its entire production serving stack on disaggregated inference – using RDMA for cache transfers. This is also used by Meta, LinkedIn and Mistral.
- The beginning of 2026 brought NVIDIA launched an orchestration framework called Dynamo that treats Prefill and Decode as first-class pool types.
- Die Open-Source-Engines vLLM and SGLang have been supplemented with native disaggregated serving modes.
- Red Hat and IBM Research have released an open source Kubernetes-native implementation called llm-d that maps the architecture to standard cluster management workflows.
So it is not a prototype for research purposes, but the standard architecture of the companies that serve the majority of global LLM traffic. This ultimately convinced the customer – I got the green light to run a two-week proof of concept.
GPU saving experiences from practice
I started by splitting the cluster into two pools: eight GPUs dedicated solely to prompt processing. The rest took over the token generation. No new hardware or clusters were needed – just a configuration change in the serving layer and a routing policy that routed each request to the correct pool based on its inference phase.
The prompt processing pool consistently achieved a compute utilization of 90 to 95 percent because it only performed this task. No token generation competing for scheduling slots, no decode requests sitting idle while a prefill burst consumed cores. In the end, the pool for token generation had the bigger surprise in store: By bundling hundreds of parallel decode requests, memory accesses were better distributed. Bandwidth usage increased from 30 to over 70 percent. The overall computing efficiency roughly doubled.
The cost calculation followed: Previously, the customer invested around two million dollars annually for GPU hours in inference. After the unbundling, he was well on his way to covering these costs 600.000 bis $800,000 to reduce – with identical request volumes and latency targets. Speaking of latency: There was also a significant improvement here. In the monolithic configuration, each new prompt resulted in active token generation requests being blocked. For users, this meant that the output was interrupted in the middle of a sentence so that another user’s prompt could be processed. After the transition, we experienced a consistent token rate with no prefill-related delays. The P99 latency between tokens flattened completely.
However, this solution is not a one-size-fits-all approach either:
- Short prompts under 512 tokens with short outputs do not generate enough cache to justify a network transfer.
- Multi-turn conversations, where over 80 percent of the cache from a previous round already exists on the decode worker, are better served locally.
- When there are fewer than a dozen GPUs, the scheduling overhead of having two pools can negate the savings.
However, none of this is likely to apply to teams for whom GPU shortages and costs are a problem. Rather, you’re running tens to hundreds of enterprise-scale GPUs where utilization losses add up to millions each year.
An appeal to the industry
Our industry spends a lot of energy on the GPU supply side: building more factories, designing better chips, negotiating more comprehensive cloud contracts. These things are undoubtedly important. However, if the teams running monolithic LLM inference today switched to disaggregated serving, the effective available GPU supply would double almost overnight.
So if you haven’t broken down your inference workload by phase yet, I can only recommend that you do so. Add one instrumentation per phase to your serving layer. Graph the prefill and decode utilization separately over a 24-hour period. If the two lines look like they belong on different graphs (spoiler: they will), that’s exactly what they’re doing. And you no longer pay for computing power that you don’t use. (fm)
This article was published as part of the English-speaking Expert Contributor Network published by Foundry. All information about the German expert network can be found here.
