Case Study · Alibaba PAI GPU Cluster Trace v2020 · NSDI'22
Direct analysis of Alibaba's production PAI cluster — 6,742 GPUs, 3,033,232 instance records over two months — shows the median GPU instance averages just 1.5% compute utilisation across its lifetime, while 54% of GPU memory-time is consumed by workloads that barely use the compute they hold.
Background
The Alibaba PAI GPU v2020 trace was published alongside the USENIX NSDI'22 paper "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters". It captures two months of production workloads from Alibaba's internal ML platform — a mix of training and inference jobs across a heterogeneous cluster of V100, T4, P100, and older-generation GPUs.
The key table — pai_sensor_table
— records per-instance, lifetime-averaged GPU compute utilisation
(gpu_wrk_util, %)
and average GPU memory used
(avg_gpu_wrk_mem, GB)
for every instance on every machine. With 3,033,232 records, it is one of the
largest public GPU workload datasets available.
gpu_wrk_util — GPU compute utilisation (%) averaged over instance lifetimeavg_gpu_wrk_mem — average GPU memory used (GB)max_gpu_wrk_mem — peak GPU memory used (GB)pai_task_table for plan_gpu (requested %) and durationFinding 1
Across all 3,033,232 instance records, GPU compute utilisation is chronically low. The mean is 10.5% and the median is just 1.5% — meaning half of all instances spend their entire lifetime running at or below 1.5% GPU utilisation. One third (34.5%) average exactly zero.
n = 3,033,232 instances · source: pai_sensor_table (gpu_wrk_util column) · mean = 10.5% · median = 1.5%
Only 5% of instances average ≥ 50% GPU compute utilisation. The other 95% spend their lifetimes below that threshold — the vast majority running at single-digit utilisation while holding GPU memory and exclusive device access for the duration of their reservation.
Finding 2
For every utilisation bucket, GPU memory is allocated and held throughout the instance lifetime — even for instances that average less than 5% compute. As utilisation rises, the amount of memory used also rises, but the correlation is not linear: a 10–30% utilisation instance holds two-thirds as much memory as an actively-computing one.
"Wasted %" = avg_gpu_wrk_mem × (1 − gpu_wrk_util/100) ÷ avg_gpu_wrk_mem · source: pai_sensor_table
Finding 3
Combining each instance's memory allocation with its runtime duration and compute utilisation gives a duration-weighted waste estimate: the fraction of total GPU memory × time that was allocated to instances making little or no use of the compute resource.
Fleet-wide: summing (avg_gpu_wrk_mem × duration) across 2,009,243 instances with duration data, then weighting by idle compute fraction: 54.4% of all GPU memory-time was wasted — memory allocated, held, and unavailable to other workloads, while the GPU averaged less than 10.5% compute utilisation.
| Utilisation bucket | Instances | % of total | Memory waste % | Share of fleet waste |
|---|---|---|---|---|
| = 0% | 394,903 | 19.7% | 100.0% | |
| 1–5% | 687,521 | 34.2% | 97.4% | |
| 5–10% | 288,338 | 14.3% | 92.8% | |
| 10–30% | 374,720 | 18.6% | 80.7% | |
| 30–50% | 147,513 | 7.3% | 61.0% | |
| 50–80% | 89,656 | 4.5% | 36.0% | |
| > 80% | 21,923 | 1.1% | 9.7% |
The dominant waste contributors are the 1–5% and 10–30% utilisation tiers, which together account for 62.1% of all fleet memory waste — not the zero-utilisation group. These are jobs that are nominally "doing something" but using a fraction of the GPU they hold, for extended periods.
Finding 4
The gap between planned and actual GPU utilisation is nearly universal.
Joining the sensor table to the task table on job and task name reveals
that 97.9% of instances achieve lower average GPU compute than their
plan_gpu
reservation. The median shortfall is 25 percentage points.
mean gap between planned GPU allocation and actual GPU utilisation. Jobs request 68% of a GPU on average; they use an average of 10.5%.
of instances are over-provisioned — they receive more GPU capacity than they actually use, locking that excess as idle, reserved memory.
| GPU type | Instances | Mean planned GPU % | Mean actual util % | Mean memory held (GB) |
|---|---|---|---|---|
| MISC | 1,887,183 | 47.7% | 6.7% | 1.22 GB |
| T4 | 562,367 | 55.4% | 14.9% | 1.23 GB |
| P100 | 466,273 | 68.9% | 11.8% | 2.44 GB |
| V100 | 63,847 | 138.6% | 22.0% | 3.78 GB |
| V100M32 | 43,075 | 246.8% | 82.8% | 18.75 GB |
Methodology
The PAI v2020 sensor table provides lifetime-averaged metrics per instance — not a time series. The waste estimate is therefore per-instance, not per-second: for each instance, the "wasted" memory is the fraction of its held VRAM that corresponded to idle compute time.
Pipeline
Files are hosted on Aliyun OSS (~442 MB compressed total). Headers are
stored separately in the GitHub repo as .header files and
injected at load time.
Seven buckets: 0%, 1–5%, 5–10%, 10–30%, 30–50%, 50–80%, > 80%. Compute per-bucket instance count, mean memory, and total memory held.
wasted_mem_gb = avg_gpu_wrk_mem × (1 − gpu_wrk_util/100),
clipped at zero. Aggregated across all instances for the fleet-wide number.
Merge on (job_name, task_name). Duration = end_time − start_time for completed tasks. Compute waste_fraction = wasted_mem × duration / total_mem × duration.
Implications
At 10.5% mean utilisation, every GPU is effectively doing the work of 0.1 GPUs on average. At $3/hr per A100, a 100-GPU cluster wastes ~$2.4M/year in reserved-but-idle compute.
Whether a job runs at 1% or 80% compute, its GPU memory allocation does not change. The 54% of memory-time wasted represents real VRAM that could serve other workloads — but is inaccessible.
3,033,232 instances from a two-month production window. Not a sample, not a benchmark — the actual workload of a real multi-thousand-GPU ML platform, published at NSDI'22.
Reproducibility
Data files are hosted on Aliyun OSS (~442 MB). Download script is in the GitHub repository. The analysis runs in under 3 minutes on a laptop.
Dataset: alibaba/clusterdata · PAI GPU v2020 · Weng et al., USENIX NSDI'22 · Analysis script and methodology available at hello@affinode.io.