affinode ← Back

Case Study · Alibaba GenAI Cluster Dataset 2026

GPUs idle at 0% compute
67% of the time
while holding 26 GB of memory

Direct analysis of Alibaba's production Stable Diffusion serving cluster — 143 pods, 161,000+ telemetry samples — shows that inference pods spend the vast majority of their lifetime at zero compute utilisation while keeping 26 GB of GPU memory exclusively reserved. Memory stays pinned whether or not any request is being served.

Dataset Alibaba GenTD26 (SoCC'25)
Pods analysed 143
Telemetry rows 161,413 (memory) · 157,417 (compute)
Adapters in fleet 874 LoRA · 104 base models
66.8%
of pod-time at exactly 0% compute
25.9 GB
avg GPU memory held per idle pod
7.0%
mean compute utilisation fleet-wide
87%
of pods idle majority of their lifetime

The dataset

The Alibaba GenTD26 dataset is a production trace from Alibaba's serverless Stable Diffusion image-generation platform. Unlike aggregate cluster statistics, it captures GPU metrics at the per-pod (per-process) level — every individual inference container is measured separately over time.

The cluster serves 874 unique LoRA adapters across 104 base model versions simultaneously. Each pod loads a base model plus one or more adapters into GPU memory and keeps them pinned to avoid cold-start latency on subsequent requests. The result: VRAM is permanently reserved at full capacity by every live pod, regardless of whether inference is actively running.

Telemetry files used

  • pod_gpu_memory_used_bytes_anon.csv — GPU memory (bytes) per pod over time
  • pod_gpu_duty_cycle_anon.csv — GPU compute utilisation (%) per pod
  • data_trace_processed.csv — per-request execution trace with LoRA config
  • Joined on (container_ip, 30 s timestamp bucket)

Fleet at a glance

  • 143 inference pods with concurrent memory + compute telemetry
  • 874 unique LoRA adapters + 104 base models pinned in VRAM
  • 68,195 requests in the trace; median inference: 23 s
  • Mean GPU compute utilisation across all pods: 7.0%

Two-thirds of pod-time is at 0% GPU compute

Across all 157,417 (pod, timestamp) samples in the GPU duty-cycle file, the most common state by far is a pod running at exactly zero compute utilisation. The GPU is sitting completely idle — no kernels executing, no tensors being processed — while the pod continues to hold its full memory reservation.

GPU compute utilisation distribution — all (pod × timestamp) samples
= 0% (fully idle)
66.8%
1–5% (near-idle)
5.5%
5–50% (partial use)
25.4%
> 50% (active)
2.3%

n = 157,417 samples · 143 pods · source: pod_gpu_duty_cycle_anon.csv

Only 2.3% of all observed (pod, time) samples show a GPU running above 50% compute utilisation. The median GPU duty cycle is exactly 0%. These pods are not lightly loaded — they are overwhelmingly idle, yet they hold their full VRAM allocation around the clock.

Memory doesn't drop when compute goes idle

The critical question is whether pods release GPU memory during idle periods. The data is unambiguous: they do not. Joining the memory and compute files on (pod, 30-second bucket) shows that idle pods hold nearly the same amount of VRAM as actively computing pods.

Average GPU memory per pod — compute-idle vs. compute-active
Compute = 0%
(66.8% of time)
25.9 GB reserved, idle
~4.5 GB free
25.9 GB
Compute > 50%
(2.3% of time)
30.3 GB — base model + active LoRA
30.3 GB

Ratio: idle pods hold 85% as much GPU memory as actively-inferring pods. Memory barely changes between idle and active states. Source: pod_gpu_memory_used_bytes_anon.csv ⋈ pod_gpu_duty_cycle_anon.csv

Why the memory doesn't drop: LoRA adapter weights and the base model are loaded into VRAM at pod startup and kept resident. The serving framework treats memory as a static allocation. There is no eviction, no compaction, and no mechanism to temporarily yield VRAM between requests — so 26 GB sits pinned at all times, even during the 67% of pod-lifetime where the GPU computes nothing.

87% of pods are idle more than half their lifetime

The 66.8% fleet-average idle rate is not driven by a few outlier pods. It is the norm. Looking at each pod's individual idle fraction across its full observed lifetime, the distribution is concentrated above 60% for the vast majority of pods.

Per-pod idle fraction (compute = 0%) Percentile Visual
Minimum 0%
0.0%
25th percentile p25
64.4%
Median pod p50
65.9%
75th percentile p75
68.2%
90th percentile p90
92.1%
Maximum p100
99.8%

125 / 143

pods spend more than half their observed lifetime at exactly 0% GPU compute utilisation — while holding their full GPU memory reservation throughout.

141 / 143

pods spend more than half their lifetime below 5% compute — only two pods in the entire fleet average meaningful GPU activity across their lifetimes.

66.7% of GPU memory-time is wasted across the fleet

Combining memory held per pod with each pod's idle fraction gives a fleet-wide measure of wasted GPU memory: the integral of memory reserved but contributing no inference work, expressed as a share of total memory-time allocated.

Fleet-wide measurement: summing (memory held × time) across all 143 pods, then allocating the fraction corresponding to idle periods: 66.7% of all GPU memory-time in this cluster was wasted — VRAM allocated, pinned, and unavailable to any other workload, while the GPU computed nothing.

This is not a rounding error or a measurement artifact. It reflects a fundamental architectural constraint of static memory allocation in multi-adapter serving: every pod holds every adapter it might ever need, at all times, because there is no mechanism to temporarily return memory to a pool and reclaim it when needed.

Metric Value Source
Total GPU memory-time (GB · 30s buckets) 3,883,326 pod_gpu_memory_used_bytes_anon
Idle memory-time (compute = 0%, GB · buckets) 2,588,678 joined memory × duty cycle
Fleet memory waste fraction 66.7% computed
Mean GPU memory per pod (all time) 24.2 GB pod_gpu_memory_used_bytes_anon
Mean GPU memory per idle pod 25.9 GB joined (compute = 0% rows)
Fleet-wide mean compute utilisation 7.0% pod_gpu_duty_cycle_anon

874 adapters, pinned forever, on 143 pods

The request trace shows 874 distinct LoRA adapters in use, each identified by a unique model version ID, served across 104 base model variants. Inference itself is fast once memory is loaded — the median request completes in 23 seconds — but requests are sparse. Most pods sit waiting between jobs, holding their full VRAM allocation to avoid the latency of reloading adapters from storage.

Request trace statistic Value
Total requests in trace 68,195
Unique base models pinned 104
Unique LoRA adapters in use 874
Median inference exec time 23 s
p99 inference exec time 92 s
Requests with 0 LoRA adapters (base model only) 52,993 (77.7%)
Requests with 1+ LoRA adapters 15,202 (22.3%)
The cold-start trap: Each pod pins its model weights to avoid reloading latency. But the data reveals the cost of that choice: a pod that completes a request in 23 seconds then sits idle — holding 26 GB of VRAM — for minutes or hours before the next request arrives. The per-pod idle fractions confirm this is the dominant pattern, not the exception.

What this means for your inference cluster

🧠

Memory, not compute, is the bottleneck

Even at 7% mean compute utilisation, every GPU is fully allocated. The cluster is memory-full, not compute-full. New workloads cannot start because there is no free VRAM — not because GPUs are busy.

💰

66% more capacity, same hardware

If 66.7% of GPU memory-time is idle, reclaiming it via checkpoint-and-restore means the same GPU fleet could host ~2x the number of concurrent pods — without any new hardware spend.

📊

First-party production data

These numbers come from direct analysis of Alibaba's own cluster telemetry, not modelling or surveys. The per-pod granularity makes the waste unambiguous: you can trace each GB to each pod at each second.

Checkpoint the pod, reclaim the memory

The root problem is that GPU memory allocation is binary: a pod either holds all its VRAM or it doesn't exist. There is no middle state where a pod is suspended, its memory freed, and it can be seamlessly resumed when the next request arrives. Affinode introduces exactly that state.

1
Detect idle pods from duty cycle telemetry

Monitor GPU compute utilisation per pod. When a pod falls below an activity threshold for a configurable window — matching the 0% compute idle state observed in 66.8% of samples — it becomes a candidate for suspension.

2
Checkpoint GPU state via CUDA checkpoint API

Affinode serialises the pod's complete GPU memory image — base model weights, LoRA adapter tensors, KV caches — to host DRAM or NVMe. The 26 GB VRAM allocation is released back to the pool immediately.

3
Grant freed VRAM to queued workloads

Other pods, new adapter deployments, or batch jobs can now start in the reclaimed memory. No cold-start overhead for the new consumer.

4
Restore on next inference request

When a new request arrives for the suspended pod, Affinode restores the GPU checkpoint. The pod resumes with all model weights already in place — no re-download from storage, no model reload, identical to the pre-suspend state.

Applied to this cluster: suspending pods during their median 65.9% idle fraction would free approximately 17 GB per pod · 143 pods = 2.4 TB of GPU memory that is currently pinned but unused — available for new workloads, without adding a single GPU or changing a line of serving code.

Run the analysis yourself

All data is publicly available directly from the Alibaba GitHub repository. The two core files download in under 2 MB total. The analysis script runs in under 60 seconds on a laptop.

# Download files (stored directly in GitHub, no LFS)
curl -L https://raw.githubusercontent.com/alibaba/clusterdata/
    master/cluster-trace-v2026-GenAI/pod_gpu_memory_used_bytes_anon.tar.gz -o mem.tar.gz
curl -L https://raw.githubusercontent.com/alibaba/clusterdata/
    master/cluster-trace-v2026-GenAI/pod_gpu_duty_cycle_anon.tar.gz -o duty.tar.gz

# Run analysis
pip install pandas numpy
python3 analyze.py

Dataset: alibaba/clusterdata · GenTD26 · Analysis script and methodology available at hello@affinode.io.