Skip to content
AI / ML· 2 min

Async continuous batching in HF Transformers lifts GPU utilisation from 76% to 99%

A Hugging Face engineering post details a three-stream CUDA architecture that decouples CPU batch preparation from GPU compute — measured 22% faster end-to-end on an 8B-parameter Llama benchmark.

A Hugging Face engineering post details a three-stream CUDA architecture that decouples CPU batch preparation from GPU compute — measured 22% faster end-to-end on an 8B-parameter Llama benchmark.

Hugging Face published the second instalment of its LLM-inference engineering series on 14 May 2026, detailing how asynchronous continuous batching has been added to the `transformers` library [1]. The change targets a familiar problem: in synchronous continuous batching, the CPU and GPU alternate — "while the GPU computes, the CPU waits. And while the CPU prepares the next batch, the GPU waits" [1].

The team measured the cost. In their baseline run — 8B-parameter model, batch size 32, 8K-token outputs on H-class GPUs in the benchmark setup — total generation time was 300.6 seconds, with 24% of that time spent on an idle GPU [1]. After switching to the asynchronous implementation, total time dropped to 234.5 seconds and GPU utilisation rose from 76.0% to 99.4% — a 22% end-to-end speedup [1]. The post frames the predicted ceiling at 24% (the original idle fraction) and observes the measured 22% as close to that ceiling.

The mechanism uses three CUDA streams: host-to-device transfer, compute, and device-to-host transfer. Each stream's operations are sequential internally but independent across streams, with event-based synchronisation enforcing the data dependencies. The team pairs this with a double-buffered "slot A / slot B" arrangement for input and output tensors — while the GPU processes batch N out of slot A, the CPU is preparing batch N+1 into slot B [1].

There's a memory cost. Double-buffering "doubles the amount of RAM and VRAM used to store the input and output tensors" [1]. The team mitigates this by pairing the technique with FlashAttention, which eliminates the attention mask tensor — described as "by far the largest input tensor" — so the practical overhead is smaller than the doubled-buffer accounting suggests [1].

The post also addresses CUDA-graph constraints. CUDA graphs are recorded against specific memory addresses, so a graph captured for slot A cannot be replayed against slot B's buffers; the implementation captures two graphs. The graphs are placed in a shared CUDA memory pool, so "both graphs together use nearly the same amount of VRAM as one" [1].

The economic motivation is included explicitly in the post: an H200 on HF Inference Endpoints costs around $5 an hour, "cheap for an hour, but use it for a day and you are already paying $120" [1]. A 22% utilisation gain at production scale is real money.

The implementation is in the `transformers` library — the entry point is `continuous_batching.py`, and the async-specific code sits in the `ContinuousBatchingAsyncIOs` class [1]. The team does not explicitly benchmark against vLLM's PagedAttention, TGI, or other competing serving stacks; the comparison provided is against `transformers`'s own previous synchronous continuous batching.

For ML-platform teams running self-hosted inference on `transformers`, upgrading is the obvious move. For those on vLLM or TGI, the engineering details are a useful primer on what synchronous-batching overhead looks like in practice — and a benchmark to compare your current stack against.