Case Study · Alibaba PAI GPU Cluster Trace v2020 · NSDI'22

Mean GPU utilisation:
10.5% — across 3 million
production ML instances

Direct analysis of Alibaba's production PAI cluster — 6,742 GPUs, 3,033,232 instance records over two months — shows the median GPU instance averages just 1.5% compute utilisation across its lifetime, while 54% of GPU memory-time is consumed by workloads that barely use the compute they hold.

Dataset Alibaba PAI GPU v2020 (NSDI'22)

GPUs 6,742 across 1,897 machines

Instance records 3,033,232

Period July – August 2020

Background

The dataset

The Alibaba PAI GPU v2020 trace was published alongside the USENIX NSDI'22 paper "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters". It captures two months of production workloads from Alibaba's internal ML platform — a mix of training and inference jobs across a heterogeneous cluster of V100, T4, P100, and older-generation GPUs.

The key table — pai_sensor_table — records per-instance, lifetime-averaged GPU compute utilisation (gpu_wrk_util, %) and average GPU memory used (avg_gpu_wrk_mem, GB) for every instance on every machine. With 3,033,232 records, it is one of the largest public GPU workload datasets available.

Cluster at a glance

1,897 machines · 6,742 GPUs · 116,192 GB total VRAM
GPU types: MISC (2,240), P100 (1,596), T4 (994), V100M32 (1,080), V100 (832)
3,033,232 instance records in the sensor table
July – August 2020 (2 months), timestamps anonymised

pai_sensor_table columns used

gpu_wrk_util — GPU compute utilisation (%) averaged over instance lifetime
avg_gpu_wrk_mem — average GPU memory used (GB)
max_gpu_wrk_mem — peak GPU memory used (GB)
Joined to pai_task_table for plan_gpu (requested %) and duration

Finding 1

75% of instances average below 10% GPU compute

Across all 3,033,232 instance records, GPU compute utilisation is chronically low. The mean is 10.5% and the median is just 1.5% — meaning half of all instances spend their entire lifetime running at or below 1.5% GPU utilisation. One third (34.5%) average exactly zero.

GPU compute utilisation distribution — all 3,033,232 instance records

= 0% (fully idle)

34.5%

1–5% (near-idle)

30.0%

5–10%

10.9%

10–30%

14.0%

30–50%

5.5%

> 50% (active)

5.0%

n = 3,033,232 instances · source: pai_sensor_table (gpu_wrk_util column) · mean = 10.5% · median = 1.5%

Only 5% of instances average ≥ 50% GPU compute utilisation. The other 95% spend their lifetimes below that threshold — the vast majority running at single-digit utilisation while holding GPU memory and exclusive device access for the duration of their reservation.

Finding 2

GPU memory is held even when compute is near-zero

For every utilisation bucket, GPU memory is allocated and held throughout the instance lifetime — even for instances that average less than 5% compute. As utilisation rises, the amount of memory used also rises, but the correlation is not linear: a 10–30% utilisation instance holds two-thirds as much memory as an actively-computing one.

Mean GPU memory held (GB) by average compute utilisation bucket

= 0% compute
(34.5% of instances)

0.12 GB — allocated, 0% compute

0.12 GB

1–5% compute
(30.0% of instances)

1.29 GB

~97% wasted

1.29 GB

5–10% compute
(10.9% of instances)

2.11 GB

~93% wasted

2.11 GB

10–30% compute
(14.0% of instances)

2.74 GB

~81% wasted

2.74 GB

> 50% compute
(5.0% of instances)

9.87 GB — actively inferring / training

9.87 GB

"Wasted %" = avg_gpu_wrk_mem × (1 − gpu_wrk_util/100) ÷ avg_gpu_wrk_mem · source: pai_sensor_table

The memory-compute decoupling: GPU memory is allocated at reservation time and held for the full instance lifetime. Whether the GPU is running at 1% or 80% compute, the memory footprint does not shrink. A job that loads a model into VRAM then sits idle — waiting for data, sleeping between epochs, or idling in an interactive session — keeps that memory pinned throughout, blocking it from any other workload.

Finding 3

54.4% of GPU memory-time is wasted, duration-weighted

Combining each instance's memory allocation with its runtime duration and compute utilisation gives a duration-weighted waste estimate: the fraction of total GPU memory × time that was allocated to instances making little or no use of the compute resource.

Fleet-wide: summing (avg_gpu_wrk_mem × duration) across 2,009,243 instances with duration data, then weighting by idle compute fraction: 54.4% of all GPU memory-time was wasted — memory allocated, held, and unavailable to other workloads, while the GPU averaged less than 10.5% compute utilisation.

Utilisation bucket	Instances	% of total	Memory waste %	Share of fleet waste
= 0%	394,903	19.7%	100.0%	0.8%
1–5%	687,521	34.2%	97.4%	31.3%
5–10%	288,338	14.3%	92.8%	18.4%
10–30%	374,720	18.6%	80.7%	30.8%
30–50%	147,513	7.3%	61.0%	11.1%
50–80%	89,656	4.5%	36.0%	6.4%
> 80%	21,923	1.1%	9.7%	1.2%

The dominant waste contributors are the 1–5% and 10–30% utilisation tiers, which together account for 62.1% of all fleet memory waste — not the zero-utilisation group. These are jobs that are nominally "doing something" but using a fraction of the GPU they hold, for extended periods.

Finding 4

97.9% of instances use less GPU than they requested

The gap between planned and actual GPU utilisation is nearly universal. Joining the sensor table to the task table on job and task name reveals that 97.9% of instances achieve lower average GPU compute than their plan_gpu reservation. The median shortfall is 25 percentage points.

46.8 pp

mean gap between planned GPU allocation and actual GPU utilisation. Jobs request 68% of a GPU on average; they use an average of 10.5%.

97.9%

of instances are over-provisioned — they receive more GPU capacity than they actually use, locking that excess as idle, reserved memory.

GPU type	Instances	Mean planned GPU %	Mean actual util %	Mean memory held (GB)
MISC	1,887,183	47.7%	6.7%	1.22 GB
T4	562,367	55.4%	14.9%	1.23 GB
P100	466,273	68.9%	11.8%	2.44 GB
V100	63,847	138.6%	22.0%	3.78 GB
V100M32	43,075	246.8%	82.8%	18.75 GB

V100M32 is the exception: The 32 GB V100 jobs average 82.8% utilisation and 18.75 GB memory — these are the large training jobs the cluster was built for. But they represent only 1.4% of all instances. The other 98.6% run on GPU types where actual utilisation averages 7–22% of the planned allocation.

Methodology

How GPU memory waste is measured

The PAI v2020 sensor table provides lifetime-averaged metrics per instance — not a time series. The waste estimate is therefore per-instance, not per-second: for each instance, the "wasted" memory is the fraction of its held VRAM that corresponded to idle compute time.

        -- Per-instance wasted GPU memory (GB)

        wasted_mem_gb = avg_gpu_wrk_mem × (1 − gpu_wrk_util / 100)

        -- Duration-weighted fleet waste fraction

        waste_fraction = Σ(wasted_mem_gb × duration) / Σ(avg_gpu_wrk_mem × duration)

        -- Result: 54.4% of GPU memory-time wasted

Conservative definition: This formula treats the compute utilisation as a proxy for memory utilisation — a job at 30% GPU compute is counted as wasting only 70% of its VRAM, even though the full model weights remain resident at 100% throughout. The true waste is likely higher: a 30% utilisation job still pins all its VRAM regardless of compute level. The 54.4% figure should be read as a lower bound.

Pipeline

Download pai_sensor_table, pai_task_table, pai_machine_spec

Files are hosted on Aliyun OSS (~442 MB compressed total). Headers are stored separately in the GitHub repo as .header files and injected at load time.

Bucket instances by gpu_wrk_util

Seven buckets: 0%, 1–5%, 5–10%, 10–30%, 30–50%, 50–80%, > 80%. Compute per-bucket instance count, mean memory, and total memory held.

Compute per-instance wasted memory

wasted_mem_gb = avg_gpu_wrk_mem × (1 − gpu_wrk_util/100), clipped at zero. Aggregated across all instances for the fleet-wide number.

Join to task table for duration-weighting and plan_gpu comparison

Merge on (job_name, task_name). Duration = end_time − start_time for completed tasks. Compute waste_fraction = wasted_mem × duration / total_mem × duration.

Implications

What 10.5% mean utilisation means for your cluster

💰

Paying for 10×, using 1×

At 10.5% mean utilisation, every GPU is effectively doing the work of 0.1 GPUs on average. At $3/hr per A100, a 100-GPU cluster wastes ~$2.4M/year in reserved-but-idle compute.

🧠

Memory stays pinned regardless

Whether a job runs at 1% or 80% compute, its GPU memory allocation does not change. The 54% of memory-time wasted represents real VRAM that could serve other workloads — but is inaccessible.

📊

Peer-reviewed, large-scale evidence

3,033,232 instances from a two-month production window. Not a sample, not a benchmark — the actual workload of a real multi-thousand-GPU ML platform, published at NSDI'22.

Reproducibility

Run it yourself

Data files are hosted on Aliyun OSS (~442 MB). Download script is in the GitHub repository. The analysis runs in under 3 minutes on a laptop.

# Download data (from Alibaba's open-trace OSS bucket)

curl -O https://aliopentrace.oss-cn-beijing.aliyuncs.com/v2020GPUTraces/pai_sensor_table.tar.gz

curl -O https://aliopentrace.oss-cn-beijing.aliyuncs.com/v2020GPUTraces/pai_task_table.tar.gz

curl -O https://aliopentrace.oss-cn-beijing.aliyuncs.com/v2020GPUTraces/pai_machine_spec.tar.gz

# Run analysis

pip install pandas numpy

python3 analyze.py

Dataset: alibaba/clusterdata · PAI GPU v2020 · Weng et al., USENIX NSDI'22 · Analysis script and methodology available at hello@affinode.io.

Mean GPU utilisation: 10.5% — across 3 million production ML instances

The dataset

Cluster at a glance

pai_sensor_table columns used

75% of instances average below 10% GPU compute

GPU memory is held even when compute is near-zero

54.4% of GPU memory-time is wasted, duration-weighted

97.9% of instances use less GPU than they requested

46.8 pp

97.9%

How GPU memory waste is measured

What 10.5% mean utilisation means for your cluster

Paying for 10×, using 1×

Memory stays pinned regardless

Peer-reviewed, large-scale evidence

Run it yourself

Mean GPU utilisation:
10.5% — across 3 million
production ML instances