Case Study · Alibaba GenAI Cluster Dataset 2026

GPUs idle at 0% compute
67% of the time
while holding 26 GB of memory

Direct analysis of Alibaba's production Stable Diffusion serving cluster — 143 pods, 161,000+ telemetry samples — shows that inference pods spend the vast majority of their lifetime at zero compute utilisation while keeping 26 GB of GPU memory exclusively reserved. Memory stays pinned whether or not any request is being served.

Dataset Alibaba GenTD26 (SoCC'25)

Pods analysed 143

Telemetry rows 161,413 (memory) · 157,417 (compute)

Adapters in fleet 874 LoRA · 104 base models

Background

The dataset

The Alibaba GenTD26 dataset is a production trace from Alibaba's serverless Stable Diffusion image-generation platform. Unlike aggregate cluster statistics, it captures GPU metrics at the per-pod (per-process) level — every individual inference container is measured separately over time.

The cluster serves 874 unique LoRA adapters across 104 base model versions simultaneously. Each pod loads a base model plus one or more adapters into GPU memory and keeps them pinned to avoid cold-start latency on subsequent requests. The result: VRAM is permanently reserved at full capacity by every live pod, regardless of whether inference is actively running.

Telemetry files used

pod_gpu_memory_used_bytes_anon.csv — GPU memory (bytes) per pod over time
pod_gpu_duty_cycle_anon.csv — GPU compute utilisation (%) per pod
data_trace_processed.csv — per-request execution trace with LoRA config
Joined on (container_ip, 30 s timestamp bucket)

Fleet at a glance

143 inference pods with concurrent memory + compute telemetry
874 unique LoRA adapters + 104 base models pinned in VRAM
68,195 requests in the trace; median inference: 23 s
Mean GPU compute utilisation across all pods: 7.0%

Finding 1

Two-thirds of pod-time is at 0% GPU compute

Across all 157,417 (pod, timestamp) samples in the GPU duty-cycle file, the most common state by far is a pod running at exactly zero compute utilisation. The GPU is sitting completely idle — no kernels executing, no tensors being processed — while the pod continues to hold its full memory reservation.

GPU compute utilisation distribution — all (pod × timestamp) samples

= 0% (fully idle)

66.8%

1–5% (near-idle)

5.5%

5–50% (partial use)

25.4%

> 50% (active)

2.3%

n = 157,417 samples · 143 pods · source: pod_gpu_duty_cycle_anon.csv

Only 2.3% of all observed (pod, time) samples show a GPU running above 50% compute utilisation. The median GPU duty cycle is exactly 0%. These pods are not lightly loaded — they are overwhelmingly idle, yet they hold their full VRAM allocation around the clock.

Finding 2

Memory doesn't drop when compute goes idle

The critical question is whether pods release GPU memory during idle periods. The data is unambiguous: they do not. Joining the memory and compute files on (pod, 30-second bucket) shows that idle pods hold nearly the same amount of VRAM as actively computing pods.

Average GPU memory per pod — compute-idle vs. compute-active

Compute = 0%
(66.8% of time)

25.9 GB reserved, idle

~4.5 GB free

25.9 GB

Compute > 50%
(2.3% of time)

30.3 GB — base model + active LoRA

30.3 GB

Ratio: idle pods hold 85% as much GPU memory as actively-inferring pods. Memory barely changes between idle and active states. Source: pod_gpu_memory_used_bytes_anon.csv ⋈ pod_gpu_duty_cycle_anon.csv

Why the memory doesn't drop: LoRA adapter weights and the base model are loaded into VRAM at pod startup and kept resident. The serving framework treats memory as a static allocation. There is no eviction, no compaction, and no mechanism to temporarily yield VRAM between requests — so 26 GB sits pinned at all times, even during the 67% of pod-lifetime where the GPU computes nothing.

Finding 3

87% of pods are idle more than half their lifetime

The 66.8% fleet-average idle rate is not driven by a few outlier pods. It is the norm. Looking at each pod's individual idle fraction across its full observed lifetime, the distribution is concentrated above 60% for the vast majority of pods.

Per-pod idle fraction (compute = 0%)	Percentile	Visual
Minimum	0%	0.0%
25th percentile	p25	64.4%
Median pod	p50	65.9%
75th percentile	p75	68.2%
90th percentile	p90	92.1%
Maximum	p100	99.8%

125 / 143

pods spend more than half their observed lifetime at exactly 0% GPU compute utilisation — while holding their full GPU memory reservation throughout.

141 / 143

pods spend more than half their lifetime below 5% compute — only two pods in the entire fleet average meaningful GPU activity across their lifetimes.

Finding 4

66.7% of GPU memory-time is wasted across the fleet

Combining memory held per pod with each pod's idle fraction gives a fleet-wide measure of wasted GPU memory: the integral of memory reserved but contributing no inference work, expressed as a share of total memory-time allocated.

Fleet-wide measurement: summing (memory held × time) across all 143 pods, then allocating the fraction corresponding to idle periods: 66.7% of all GPU memory-time in this cluster was wasted — VRAM allocated, pinned, and unavailable to any other workload, while the GPU computed nothing.

This is not a rounding error or a measurement artifact. It reflects a fundamental architectural constraint of static memory allocation in multi-adapter serving: every pod holds every adapter it might ever need, at all times, because there is no mechanism to temporarily return memory to a pool and reclaim it when needed.

Metric	Value	Source
Total GPU memory-time (GB · 30s buckets)	3,883,326	pod_gpu_memory_used_bytes_anon
Idle memory-time (compute = 0%, GB · buckets)	2,588,678	joined memory × duty cycle
Fleet memory waste fraction	66.7%	computed
Mean GPU memory per pod (all time)	24.2 GB	pod_gpu_memory_used_bytes_anon
Mean GPU memory per idle pod	25.9 GB	joined (compute = 0% rows)
Fleet-wide mean compute utilisation	7.0%	pod_gpu_duty_cycle_anon

Root Cause

874 adapters, pinned forever, on 143 pods

The request trace shows 874 distinct LoRA adapters in use, each identified by a unique model version ID, served across 104 base model variants. Inference itself is fast once memory is loaded — the median request completes in 23 seconds — but requests are sparse. Most pods sit waiting between jobs, holding their full VRAM allocation to avoid the latency of reloading adapters from storage.

Request trace statistic	Value
Total requests in trace	68,195
Unique base models pinned	104
Unique LoRA adapters in use	874
Median inference exec time	23 s
p99 inference exec time	92 s
Requests with 0 LoRA adapters (base model only)	52,993 (77.7%)
Requests with 1+ LoRA adapters	15,202 (22.3%)

The cold-start trap: Each pod pins its model weights to avoid reloading latency. But the data reveals the cost of that choice: a pod that completes a request in 23 seconds then sits idle — holding 26 GB of VRAM — for minutes or hours before the next request arrives. The per-pod idle fractions confirm this is the dominant pattern, not the exception.

Implications

What this means for your inference cluster

🧠

Memory, not compute, is the bottleneck

Even at 7% mean compute utilisation, every GPU is fully allocated. The cluster is memory-full, not compute-full. New workloads cannot start because there is no free VRAM — not because GPUs are busy.

💰

66% more capacity, same hardware

If 66.7% of GPU memory-time is idle, reclaiming it via checkpoint-and-restore means the same GPU fleet could host ~2x the number of concurrent pods — without any new hardware spend.

📊

First-party production data

These numbers come from direct analysis of Alibaba's own cluster telemetry, not modelling or surveys. The per-pod granularity makes the waste unambiguous: you can trace each GB to each pod at each second.

The Affinode Approach

Checkpoint the pod, reclaim the memory

The root problem is that GPU memory allocation is binary: a pod either holds all its VRAM or it doesn't exist. There is no middle state where a pod is suspended, its memory freed, and it can be seamlessly resumed when the next request arrives. Affinode introduces exactly that state.

Detect idle pods from duty cycle telemetry

Monitor GPU compute utilisation per pod. When a pod falls below an activity threshold for a configurable window — matching the 0% compute idle state observed in 66.8% of samples — it becomes a candidate for suspension.

Checkpoint GPU state via CUDA checkpoint API

Affinode serialises the pod's complete GPU memory image — base model weights, LoRA adapter tensors, KV caches — to host DRAM or NVMe. The 26 GB VRAM allocation is released back to the pool immediately.

Grant freed VRAM to queued workloads

Other pods, new adapter deployments, or batch jobs can now start in the reclaimed memory. No cold-start overhead for the new consumer.

Restore on next inference request

When a new request arrives for the suspended pod, Affinode restores the GPU checkpoint. The pod resumes with all model weights already in place — no re-download from storage, no model reload, identical to the pre-suspend state.

Applied to this cluster: suspending pods during their median 65.9% idle fraction would free approximately 17 GB per pod · 143 pods = 2.4 TB of GPU memory that is currently pinned but unused — available for new workloads, without adding a single GPU or changing a line of serving code.

Reproducibility

Run the analysis yourself

All data is publicly available directly from the Alibaba GitHub repository. The two core files download in under 2 MB total. The analysis script runs in under 60 seconds on a laptop.

# Download files (stored directly in GitHub, no LFS)

curl -L https://raw.githubusercontent.com/alibaba/clusterdata/

    master/cluster-trace-v2026-GenAI/pod_gpu_memory_used_bytes_anon.tar.gz -o mem.tar.gz

curl -L https://raw.githubusercontent.com/alibaba/clusterdata/

    master/cluster-trace-v2026-GenAI/pod_gpu_duty_cycle_anon.tar.gz -o duty.tar.gz

# Run analysis

pip install pandas numpy

python3 analyze.py

Dataset: alibaba/clusterdata · GenTD26 · Analysis script and methodology available at hello@affinode.io.

GPUs idle at 0% compute 67% of the time while holding 26 GB of memory

The dataset

Telemetry files used

Fleet at a glance

Two-thirds of pod-time is at 0% GPU compute

Memory doesn't drop when compute goes idle

87% of pods are idle more than half their lifetime

125 / 143

141 / 143

66.7% of GPU memory-time is wasted across the fleet

874 adapters, pinned forever, on 143 pods

What this means for your inference cluster

Memory, not compute, is the bottleneck

66% more capacity, same hardware

First-party production data

Checkpoint the pod, reclaim the memory

Run the analysis yourself

GPUs idle at 0% compute
67% of the time
while holding 26 GB of memory