affinode ← Back

Case Study · Alibaba PAI Cluster Traces · ATC'23

46% of GPU capacity idle
while 861 jobs queue
— fragmentation is why

Direct analysis of Alibaba's production PAI GPU cluster — 6,212 GPUs, 8,152 jobs — shows that 46% of total GPU compute capacity sits unused at any snapshot while 12.2% of GPU jobs never get scheduled. The culprit is fragmentation: fractional GPU allocation leaves residual GPU capacity that no subsequent job can fill.

Dataset Alibaba PAI GPU v2023 (ATC'23)
GPU nodes 1,213 (6,212 GPUs)
Jobs analysed 8,152 (7,064 GPU jobs)
VRAM capacity 145,280 GB total
46%
GPU compute capacity idle at snapshot
861
GPU jobs stuck in Pending — never ran
67%
residual GPU capacity wasted per fractional job
67 TB
VRAM on idle / fragmented GPUs

The dataset

The Alibaba PAI GPU v2023 traces were released alongside the USENIX ATC'23 paper "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent". They capture a real-world snapshot of Alibaba's production PAI (Platform for AI) GPU cluster: node configurations, submitted jobs, GPU allocation requests, and scheduling outcomes.

Unlike utilisation-time-series datasets, the PAI traces record the scheduling state — what jobs were submitted, what GPU resources they requested (down to milli-GPU precision), and whether they were scheduled, failed, or left waiting. This makes it uniquely suited for measuring fragmentation-driven waste: capacity that exists in the cluster but is structurally unreachable by queued jobs.

Cluster at a glance

  • 1,213 GPU nodes across 7 GPU model families
  • 6,212 total GPUs — 145,280 GB total VRAM
  • Models: G2 (4,392), T4 (842), G3 (312), V100M32 (204), V100M16 (195), P100 (265), A10 (2)
  • 86.7% of submitted jobs are GPU jobs

GPU allocation model

  • Jobs specify num_gpu (whole GPUs) + gpu_milli (milli-GPU fraction)
  • gpu_milli = 1000 means a full GPU; values below 1000 indicate GPU sharing
  • 28.3% of GPU jobs request fractional GPU (gpu_milli < 1000)
  • Fractional jobs average only 32.7% of a GPU's compute capacity

46% of cluster GPU capacity is idle at snapshot time

At the time of the trace snapshot, 4,149 GPU jobs are in Running state. Summing their allocated gpu_milli across all jobs gives 3.36 billion milli-GPU in use — against a total cluster capacity of 6.21 billion milli-GPU. The cluster is running at 54% utilisation, leaving 46% of GPU compute capacity effectively idle.

Cluster GPU compute capacity — snapshot utilisation
Total capacity
54% — 4,149 running jobs
46% — idle / fragmented
6,212 GPUs
Full-GPU jobs
71.7% — 1,000 milli (full GPU)
of running GPU jobs
2,975 jobs
Fractional jobs
32.7% used
67.3% residual — wasted per card
1,174 jobs

Source: openb_pod_list_default.csv · 4,149 Running GPU pods × mean 809 milli-GPU ÷ 6,212,000 total milli-GPU

46% of GPU capacity is idle despite 861 jobs waiting in Pending state. The cluster is not out of GPUs — it has significant raw capacity remaining. The gap between queued demand and idle supply is structural: fragmentation makes the residual capacity unreachable by the jobs that need it.

Fractional GPU allocation leaves 67% residual per card

GPU-sharing jobs (gpu_milli < 1000) are the source of fragmentation. When a job occupies 327 milli-GPU on a card that has 1,000, the remaining 673 milli-GPU is "stranded" — available in principle but often unschedulable because no other queued job requests exactly that remaining fraction.

GPU allocation tier Jobs % of GPU jobs Avg GPU fraction Residual wasted / card
Full GPU (1000m) 5,064 71.7% 100%
0%
Fractional GPU (501–999m) 415 5.9% ~75%
~25%
Fractional GPU (251–500m) 1,189 16.8% ~40%
~60%
Fractional GPU (≤ 250m) 396 5.6% ~16%
~84%

2,000

jobs (28.3% of all GPU jobs) request fractional GPU allocation — they share a physical GPU card with other workloads but rarely fill it completely.

1,600+

jobs request ≤ 50% of a GPU (gpu_milli ≤ 500), leaving more than half the GPU's compute and proportional VRAM as stranded, unreachable residual.

Why fragmentation compounds over time: Each fractional job leaves a uniquely-sized slot on its card (e.g., 680m, 550m, 330m). As more fractional jobs land, the cluster accumulates an ever-growing set of non-standard residual slots that don't match any new job's request. The ATC'23 paper names this the fragmentation gradient — and it grows silently, one partial allocation at a time.

67 TB of GPU memory associated with idle capacity

The cluster holds 145,280 GB (~142 TB) of GPU VRAM across its 6,212 cards. With 46% of GPU compute capacity idle at snapshot time, approximately 67 TB of VRAM sits on GPUs that are not actively running any workload — or running workloads that occupy only a fraction of the card's memory capacity.

Metric Value Source
Total cluster VRAM 145,280 GB openb_node_list_gpu_node.csv
GPU compute idle fraction (snapshot) 46.0% Running pods × gpu_milli ÷ total capacity
VRAM on idle / fragmented GPUs ~66,800 GB (67 TB) computed: 46% × 145,280 GB
Mean VRAM per GPU node 119.8 GB openb_node_list_gpu_node.csv
Mean gpu_milli per GPU job 809 milli (80.9% of GPU) openb_pod_list_default.csv
Mean gpu_milli for fractional jobs 327 milli (32.7% of GPU) jobs with gpu_milli < 1000
The memory-compute mismatch: GPU memory allocation in Kubernetes does not automatically follow gpu_milli proportionally — many frameworks allocate device-level VRAM pages in blocks. A job requesting 200m GPU compute may pin far more than 20% of the card's VRAM, because CUDA contexts, model weights, and framework buffers are not fractional. The residual VRAM that isn't consumed by the job's allocation is unavailable to other workloads: it's locked to the device, but neither used by the job nor returnable to a shared pool.

861 jobs never ran — starvation at full-capacity appearance

The most direct evidence that fragmentation creates real harm is the Pending pod count: 861 GPU jobs were submitted to the cluster and never scheduled at all. An additional 1,870 GPU jobs failed — for a combined 38.5% of all GPU jobs that did not complete successfully.

GPU job scheduling outcomes — all 7,064 GPU pods
Running (active)
58.7%
Failed
26.5%
Pending (never ran)
12.2%
Succeeded
2.6%

n = 7,064 GPU pods · source: openb_pod_list_default.csv (pod_phase column)

38.7% of GPU jobs — 2,731 out of 7,064 — never completed successfully. Meanwhile, 46% of cluster GPU capacity sat idle. This is the defining symptom of fragmentation: a cluster that looks full (GPU nodes allocated) but is structurally unable to serve demand because the remaining capacity is in the wrong shape.

How utilisation and fragmentation are measured

Unlike telemetry-based datasets (per-minute GPU utilisation from nvidia-smi), the PAI traces record scheduling state. GPU capacity utilisation is therefore derived from the sum of allocated gpu_milli across all running pods, divided by total cluster capacity.

-- Cluster-level GPU utilisation from scheduling state
allocated_milli = SUM(num_gpu × 1000) FOR full-GPU jobs
                    + SUM(gpu_milli) FOR fractional jobs
                    WHERE pod_phase = 'Running'

capacity_milli = SUM(gpu) × 1000 FROM node_list

utilisation = allocated_milli / capacity_milli -- 54.0%
idle_fraction = 1 − utilisation -- 46.0%
1
Load node and pod CSV files

Node file provides per-node GPU count and model (e.g., G2, T4, V100M32). Pod file provides per-job GPU request (num_gpu + gpu_milli), QoS tier, and scheduling outcome (pod_phase).

2
Compute total cluster capacity (milli-GPU)

Sum gpu × 1000 across all GPU nodes → 6,212,000 milli-GPU total.

3
Sum allocated milli-GPU for Running pods

Full-GPU jobs contribute num_gpu × 1000; fractional jobs contribute their gpu_milli value directly. Sum across 4,149 Running GPU pods.

4
Derive idle fraction and fragmentation metrics

idle_fraction = 1 − (allocated / capacity). Fractional-job waste = average of (1 − gpu_milli/1000) across all fractional GPU jobs.

5
Count Pending and Failed pods

Directly from pod_phase column: 861 Pending + 1,870 Failed GPU pods confirm that idle capacity does not translate into served demand.

What fragmentation means for your cluster

🧩

False scarcity from fragmentation

The scheduler reports GPUs as allocated. Queued jobs see no available nodes. But 46% of GPU compute sits unused inside those "allocated" cards — in mismatched fractions that no new job requests.

🧠

VRAM locked without proportional use

GPU memory does not fragment cleanly. A fractional-compute job often pins disproportionate VRAM — locking 67 TB of device memory to workloads that aren't using it fully.

📊

Peer-reviewed production evidence

This is first-party data from Alibaba's production cluster, published at USENIX ATC'23. It independently validates that fragmentation is a real, measurable, cluster-scale problem — not a theoretical concern.

Reclaim the fragments without changing job code

Fragmentation accumulates because there is no mechanism to temporarily suspend a fractional-GPU job, return its GPU slot to the pool, and restore the job when a full-GPU-equivalent slot becomes available. Affinode introduces exactly that primitive via CUDA checkpoint/restore.

1
Identify fragmented GPU slots in real time

Monitor gpu_milli allocation per node. Nodes with residual capacity below a threshold (e.g., < 200m remaining) are fragmented candidates: they hold jobs but cannot absorb new requests.

2
Checkpoint one fractional job to free the full slot

Serialise the fractional job's GPU state to host DRAM or NVMe. The entire GPU is now free — not a residual fraction of it. The freed card can be assigned to any queued job regardless of size.

3
Schedule a Pending job onto the reclaimed GPU

One of the 861 queued jobs that could not fit due to fragmentation now has a full, clean GPU slot available. No requeue, no delay, no changes to the submitted job spec.

4
Restore the checkpointed job when a slot re-opens

When the newly-scheduled job completes or another GPU frees up, Affinode restores the checkpointed fractional job transparently. The job resumes exactly where it paused — no lost progress.

Applied to this cluster: checkpoint-and-restore on even a fraction of the fragmented capacity would unlock GPUs for the 861 Pending jobs — reducing the failed/stuck job rate from 38.7% toward zero, without adding a single GPU or changing any job submission workflow.

Run it yourself

All data is publicly available from the Alibaba GitHub repository. The two core CSV files are under 650 KB combined. The analysis runs in under 30 seconds on a laptop.

# Download CSVs directly (no LFS, no authentication)
curl -L https://raw.githubusercontent.com/alibaba/clusterdata/
    master/cluster-trace-gpu-v2023/csv/openb_node_list_gpu_node.csv -o nodes.csv
curl -L https://raw.githubusercontent.com/alibaba/clusterdata/
    master/cluster-trace-gpu-v2023/csv/openb_pod_list_default.csv -o pods.csv

# Run analysis
pip install pandas numpy
python3 analyze.py

Dataset: alibaba/clusterdata · PAI GPU v2023 · Weng et al., USENIX ATC'23 · Analysis script and methodology available at hello@affinode.io.