Case Study · Alibaba GenAI Cluster Dataset 2026
Direct analysis of Alibaba's production Stable Diffusion serving cluster — 143 pods, 161,000+ telemetry samples — shows that inference pods spend the vast majority of their lifetime at zero compute utilisation while keeping 26 GB of GPU memory exclusively reserved. Memory stays pinned whether or not any request is being served.
Background
The Alibaba GenTD26 dataset is a production trace from Alibaba's serverless Stable Diffusion image-generation platform. Unlike aggregate cluster statistics, it captures GPU metrics at the per-pod (per-process) level — every individual inference container is measured separately over time.
The cluster serves 874 unique LoRA adapters across 104 base model versions simultaneously. Each pod loads a base model plus one or more adapters into GPU memory and keeps them pinned to avoid cold-start latency on subsequent requests. The result: VRAM is permanently reserved at full capacity by every live pod, regardless of whether inference is actively running.
pod_gpu_memory_used_bytes_anon.csv — GPU memory (bytes) per pod over timepod_gpu_duty_cycle_anon.csv — GPU compute utilisation (%) per poddata_trace_processed.csv — per-request execution trace with LoRA config(container_ip, 30 s timestamp bucket)Finding 1
Across all 157,417 (pod, timestamp) samples in the GPU duty-cycle file, the most common state by far is a pod running at exactly zero compute utilisation. The GPU is sitting completely idle — no kernels executing, no tensors being processed — while the pod continues to hold its full memory reservation.
n = 157,417 samples · 143 pods · source: pod_gpu_duty_cycle_anon.csv
Only 2.3% of all observed (pod, time) samples show a GPU running above 50% compute utilisation. The median GPU duty cycle is exactly 0%. These pods are not lightly loaded — they are overwhelmingly idle, yet they hold their full VRAM allocation around the clock.
Finding 2
The critical question is whether pods release GPU memory during idle periods. The data is unambiguous: they do not. Joining the memory and compute files on (pod, 30-second bucket) shows that idle pods hold nearly the same amount of VRAM as actively computing pods.
Ratio: idle pods hold 85% as much GPU memory as actively-inferring pods. Memory barely changes between idle and active states. Source: pod_gpu_memory_used_bytes_anon.csv ⋈ pod_gpu_duty_cycle_anon.csv
Finding 3
The 66.8% fleet-average idle rate is not driven by a few outlier pods. It is the norm. Looking at each pod's individual idle fraction across its full observed lifetime, the distribution is concentrated above 60% for the vast majority of pods.
| Per-pod idle fraction (compute = 0%) | Percentile | Visual |
|---|---|---|
| Minimum | 0% | |
| 25th percentile | p25 | |
| Median pod | p50 | |
| 75th percentile | p75 | |
| 90th percentile | p90 | |
| Maximum | p100 |
pods spend more than half their observed lifetime at exactly 0% GPU compute utilisation — while holding their full GPU memory reservation throughout.
pods spend more than half their lifetime below 5% compute — only two pods in the entire fleet average meaningful GPU activity across their lifetimes.
Finding 4
Combining memory held per pod with each pod's idle fraction gives a fleet-wide measure of wasted GPU memory: the integral of memory reserved but contributing no inference work, expressed as a share of total memory-time allocated.
Fleet-wide measurement: summing (memory held × time) across all 143 pods, then allocating the fraction corresponding to idle periods: 66.7% of all GPU memory-time in this cluster was wasted — VRAM allocated, pinned, and unavailable to any other workload, while the GPU computed nothing.
This is not a rounding error or a measurement artifact. It reflects a fundamental architectural constraint of static memory allocation in multi-adapter serving: every pod holds every adapter it might ever need, at all times, because there is no mechanism to temporarily return memory to a pool and reclaim it when needed.
| Metric | Value | Source |
|---|---|---|
| Total GPU memory-time (GB · 30s buckets) | 3,883,326 | pod_gpu_memory_used_bytes_anon |
| Idle memory-time (compute = 0%, GB · buckets) | 2,588,678 | joined memory × duty cycle |
| Fleet memory waste fraction | 66.7% | computed |
| Mean GPU memory per pod (all time) | 24.2 GB | pod_gpu_memory_used_bytes_anon |
| Mean GPU memory per idle pod | 25.9 GB | joined (compute = 0% rows) |
| Fleet-wide mean compute utilisation | 7.0% | pod_gpu_duty_cycle_anon |
Root Cause
The request trace shows 874 distinct LoRA adapters in use, each identified by a unique model version ID, served across 104 base model variants. Inference itself is fast once memory is loaded — the median request completes in 23 seconds — but requests are sparse. Most pods sit waiting between jobs, holding their full VRAM allocation to avoid the latency of reloading adapters from storage.
| Request trace statistic | Value |
|---|---|
| Total requests in trace | 68,195 |
| Unique base models pinned | 104 |
| Unique LoRA adapters in use | 874 |
| Median inference exec time | 23 s |
| p99 inference exec time | 92 s |
| Requests with 0 LoRA adapters (base model only) | 52,993 (77.7%) |
| Requests with 1+ LoRA adapters | 15,202 (22.3%) |
Implications
Even at 7% mean compute utilisation, every GPU is fully allocated. The cluster is memory-full, not compute-full. New workloads cannot start because there is no free VRAM — not because GPUs are busy.
If 66.7% of GPU memory-time is idle, reclaiming it via checkpoint-and-restore means the same GPU fleet could host ~2x the number of concurrent pods — without any new hardware spend.
These numbers come from direct analysis of Alibaba's own cluster telemetry, not modelling or surveys. The per-pod granularity makes the waste unambiguous: you can trace each GB to each pod at each second.
The Affinode Approach
The root problem is that GPU memory allocation is binary: a pod either holds all its VRAM or it doesn't exist. There is no middle state where a pod is suspended, its memory freed, and it can be seamlessly resumed when the next request arrives. Affinode introduces exactly that state.
Monitor GPU compute utilisation per pod. When a pod falls below an activity threshold for a configurable window — matching the 0% compute idle state observed in 66.8% of samples — it becomes a candidate for suspension.
Affinode serialises the pod's complete GPU memory image — base model weights, LoRA adapter tensors, KV caches — to host DRAM or NVMe. The 26 GB VRAM allocation is released back to the pool immediately.
Other pods, new adapter deployments, or batch jobs can now start in the reclaimed memory. No cold-start overhead for the new consumer.
When a new request arrives for the suspended pod, Affinode restores the GPU checkpoint. The pod resumes with all model weights already in place — no re-download from storage, no model reload, identical to the pre-suspend state.
Applied to this cluster: suspending pods during their median 65.9% idle fraction would free approximately 17 GB per pod · 143 pods = 2.4 TB of GPU memory that is currently pinned but unused — available for new workloads, without adding a single GPU or changing a line of serving code.
Reproducibility
All data is publicly available directly from the Alibaba GitHub repository. The two core files download in under 2 MB total. The analysis script runs in under 60 seconds on a laptop.
Dataset: alibaba/clusterdata · GenTD26 · Analysis script and methodology available at hello@affinode.io.