Case Study · Alibaba PAI Cluster Traces · ATC'23
Direct analysis of Alibaba's production PAI GPU cluster — 6,212 GPUs, 8,152 jobs — shows that 46% of total GPU compute capacity sits unused at any snapshot while 12.2% of GPU jobs never get scheduled. The culprit is fragmentation: fractional GPU allocation leaves residual GPU capacity that no subsequent job can fill.
Background
The Alibaba PAI GPU v2023 traces were released alongside the USENIX ATC'23 paper "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent". They capture a real-world snapshot of Alibaba's production PAI (Platform for AI) GPU cluster: node configurations, submitted jobs, GPU allocation requests, and scheduling outcomes.
Unlike utilisation-time-series datasets, the PAI traces record the scheduling state — what jobs were submitted, what GPU resources they requested (down to milli-GPU precision), and whether they were scheduled, failed, or left waiting. This makes it uniquely suited for measuring fragmentation-driven waste: capacity that exists in the cluster but is structurally unreachable by queued jobs.
num_gpu (whole GPUs) + gpu_milli (milli-GPU fraction)Finding 1
At the time of the trace snapshot, 4,149 GPU jobs are in Running state. Summing their allocated gpu_milli across all jobs gives 3.36 billion milli-GPU in use — against a total cluster capacity of 6.21 billion milli-GPU. The cluster is running at 54% utilisation, leaving 46% of GPU compute capacity effectively idle.
Source: openb_pod_list_default.csv · 4,149 Running GPU pods × mean 809 milli-GPU ÷ 6,212,000 total milli-GPU
46% of GPU capacity is idle despite 861 jobs waiting in Pending state. The cluster is not out of GPUs — it has significant raw capacity remaining. The gap between queued demand and idle supply is structural: fragmentation makes the residual capacity unreachable by the jobs that need it.
Finding 2
GPU-sharing jobs (gpu_milli < 1000) are the source of fragmentation. When a job occupies 327 milli-GPU on a card that has 1,000, the remaining 673 milli-GPU is "stranded" — available in principle but often unschedulable because no other queued job requests exactly that remaining fraction.
| GPU allocation tier | Jobs | % of GPU jobs | Avg GPU fraction | Residual wasted / card |
|---|---|---|---|---|
| Full GPU (1000m) | 5,064 | 71.7% | 100% | |
| Fractional GPU (501–999m) | 415 | 5.9% | ~75% | |
| Fractional GPU (251–500m) | 1,189 | 16.8% | ~40% | |
| Fractional GPU (≤ 250m) | 396 | 5.6% | ~16% |
jobs (28.3% of all GPU jobs) request fractional GPU allocation — they share a physical GPU card with other workloads but rarely fill it completely.
jobs request ≤ 50% of a GPU (gpu_milli ≤ 500), leaving more than half the GPU's compute and proportional VRAM as stranded, unreachable residual.
Finding 3
The cluster holds 145,280 GB (~142 TB) of GPU VRAM across its 6,212 cards. With 46% of GPU compute capacity idle at snapshot time, approximately 67 TB of VRAM sits on GPUs that are not actively running any workload — or running workloads that occupy only a fraction of the card's memory capacity.
| Metric | Value | Source |
|---|---|---|
| Total cluster VRAM | 145,280 GB | openb_node_list_gpu_node.csv |
| GPU compute idle fraction (snapshot) | 46.0% | Running pods × gpu_milli ÷ total capacity |
| VRAM on idle / fragmented GPUs | ~66,800 GB (67 TB) | computed: 46% × 145,280 GB |
| Mean VRAM per GPU node | 119.8 GB | openb_node_list_gpu_node.csv |
| Mean gpu_milli per GPU job | 809 milli (80.9% of GPU) | openb_pod_list_default.csv |
| Mean gpu_milli for fractional jobs | 327 milli (32.7% of GPU) | jobs with gpu_milli < 1000 |
Finding 4
The most direct evidence that fragmentation creates real harm is the Pending pod count: 861 GPU jobs were submitted to the cluster and never scheduled at all. An additional 1,870 GPU jobs failed — for a combined 38.5% of all GPU jobs that did not complete successfully.
n = 7,064 GPU pods · source: openb_pod_list_default.csv (pod_phase column)
38.7% of GPU jobs — 2,731 out of 7,064 — never completed successfully. Meanwhile, 46% of cluster GPU capacity sat idle. This is the defining symptom of fragmentation: a cluster that looks full (GPU nodes allocated) but is structurally unable to serve demand because the remaining capacity is in the wrong shape.
Methodology
Unlike telemetry-based datasets (per-minute GPU utilisation from nvidia-smi), the PAI traces record scheduling state. GPU capacity utilisation is therefore derived from the sum of allocated gpu_milli across all running pods, divided by total cluster capacity.
Pipeline
Node file provides per-node GPU count and model (e.g., G2, T4, V100M32). Pod file provides per-job GPU request (num_gpu + gpu_milli), QoS tier, and scheduling outcome (pod_phase).
Sum gpu × 1000
across all GPU nodes → 6,212,000 milli-GPU total.
Full-GPU jobs contribute num_gpu × 1000;
fractional jobs contribute their gpu_milli
value directly. Sum across 4,149 Running GPU pods.
idle_fraction = 1 − (allocated / capacity). Fractional-job waste = average of (1 − gpu_milli/1000) across all fractional GPU jobs.
Directly from pod_phase column: 861 Pending + 1,870 Failed GPU pods confirm that idle capacity does not translate into served demand.
Implications
The scheduler reports GPUs as allocated. Queued jobs see no available nodes. But 46% of GPU compute sits unused inside those "allocated" cards — in mismatched fractions that no new job requests.
GPU memory does not fragment cleanly. A fractional-compute job often pins disproportionate VRAM — locking 67 TB of device memory to workloads that aren't using it fully.
This is first-party data from Alibaba's production cluster, published at USENIX ATC'23. It independently validates that fragmentation is a real, measurable, cluster-scale problem — not a theoretical concern.
The Affinode Approach
Fragmentation accumulates because there is no mechanism to temporarily suspend a fractional-GPU job, return its GPU slot to the pool, and restore the job when a full-GPU-equivalent slot becomes available. Affinode introduces exactly that primitive via CUDA checkpoint/restore.
Monitor gpu_milli allocation per node. Nodes with residual capacity below a threshold (e.g., < 200m remaining) are fragmented candidates: they hold jobs but cannot absorb new requests.
Serialise the fractional job's GPU state to host DRAM or NVMe. The entire GPU is now free — not a residual fraction of it. The freed card can be assigned to any queued job regardless of size.
One of the 861 queued jobs that could not fit due to fragmentation now has a full, clean GPU slot available. No requeue, no delay, no changes to the submitted job spec.
When the newly-scheduled job completes or another GPU frees up, Affinode restores the checkpointed fractional job transparently. The job resumes exactly where it paused — no lost progress.
Applied to this cluster: checkpoint-and-restore on even a fraction of the fragmented capacity would unlock GPUs for the 861 Pending jobs — reducing the failed/stuck job rate from 38.7% toward zero, without adding a single GPU or changing any job submission workflow.
Reproducibility
All data is publicly available from the Alibaba GitHub repository. The two core CSV files are under 650 KB combined. The analysis runs in under 30 seconds on a laptop.
Dataset: alibaba/clusterdata · PAI GPU v2023 · Weng et al., USENIX ATC'23 · Analysis script and methodology available at hello@affinode.io.