Case Study · Alibaba PAI Cluster Traces · ATC'23

46% of GPU capacity idle
while 861 jobs queue
— fragmentation is why

Direct analysis of Alibaba's production PAI GPU cluster — 6,212 GPUs, 8,152 jobs — shows that 46% of total GPU compute capacity sits unused at any snapshot while 12.2% of GPU jobs never get scheduled. The culprit is fragmentation: fractional GPU allocation leaves residual GPU capacity that no subsequent job can fill.

Dataset Alibaba PAI GPU v2023 (ATC'23)

GPU nodes 1,213 (6,212 GPUs)

Jobs analysed 8,152 (7,064 GPU jobs)

VRAM capacity 145,280 GB total

Background

The dataset

The Alibaba PAI GPU v2023 traces were released alongside the USENIX ATC'23 paper "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent". They capture a real-world snapshot of Alibaba's production PAI (Platform for AI) GPU cluster: node configurations, submitted jobs, GPU allocation requests, and scheduling outcomes.

Unlike utilisation-time-series datasets, the PAI traces record the scheduling state — what jobs were submitted, what GPU resources they requested (down to milli-GPU precision), and whether they were scheduled, failed, or left waiting. This makes it uniquely suited for measuring fragmentation-driven waste: capacity that exists in the cluster but is structurally unreachable by queued jobs.

Cluster at a glance

1,213 GPU nodes across 7 GPU model families
6,212 total GPUs — 145,280 GB total VRAM
Models: G2 (4,392), T4 (842), G3 (312), V100M32 (204), V100M16 (195), P100 (265), A10 (2)
86.7% of submitted jobs are GPU jobs

GPU allocation model

Jobs specify num_gpu (whole GPUs) + gpu_milli (milli-GPU fraction)
gpu_milli = 1000 means a full GPU; values below 1000 indicate GPU sharing
28.3% of GPU jobs request fractional GPU (gpu_milli < 1000)
Fractional jobs average only 32.7% of a GPU's compute capacity

Finding 1

46% of cluster GPU capacity is idle at snapshot time

At the time of the trace snapshot, 4,149 GPU jobs are in Running state. Summing their allocated gpu_milli across all jobs gives 3.36 billion milli-GPU in use — against a total cluster capacity of 6.21 billion milli-GPU. The cluster is running at 54% utilisation, leaving 46% of GPU compute capacity effectively idle.

Cluster GPU compute capacity — snapshot utilisation

Total capacity

54% — 4,149 running jobs

46% — idle / fragmented

6,212 GPUs

Full-GPU jobs

71.7% — 1,000 milli (full GPU)

of running GPU jobs

2,975 jobs

Fractional jobs

32.7% used

67.3% residual — wasted per card

1,174 jobs

Source: openb_pod_list_default.csv · 4,149 Running GPU pods × mean 809 milli-GPU ÷ 6,212,000 total milli-GPU

46% of GPU capacity is idle despite 861 jobs waiting in Pending state. The cluster is not out of GPUs — it has significant raw capacity remaining. The gap between queued demand and idle supply is structural: fragmentation makes the residual capacity unreachable by the jobs that need it.

Finding 2

Fractional GPU allocation leaves 67% residual per card

GPU-sharing jobs (gpu_milli < 1000) are the source of fragmentation. When a job occupies 327 milli-GPU on a card that has 1,000, the remaining 673 milli-GPU is "stranded" — available in principle but often unschedulable because no other queued job requests exactly that remaining fraction.

GPU allocation tier	Jobs	% of GPU jobs	Avg GPU fraction	Residual wasted / card
Full GPU (1000m)	5,064	71.7%	100%	0%
Fractional GPU (501–999m)	415	5.9%	~75%	~25%
Fractional GPU (251–500m)	1,189	16.8%	~40%	~60%
Fractional GPU (≤ 250m)	396	5.6%	~16%	~84%

2,000

jobs (28.3% of all GPU jobs) request fractional GPU allocation — they share a physical GPU card with other workloads but rarely fill it completely.

1,600+

jobs request ≤ 50% of a GPU (gpu_milli ≤ 500), leaving more than half the GPU's compute and proportional VRAM as stranded, unreachable residual.

Why fragmentation compounds over time: Each fractional job leaves a uniquely-sized slot on its card (e.g., 680m, 550m, 330m). As more fractional jobs land, the cluster accumulates an ever-growing set of non-standard residual slots that don't match any new job's request. The ATC'23 paper names this the fragmentation gradient — and it grows silently, one partial allocation at a time.

Finding 3

67 TB of GPU memory associated with idle capacity

The cluster holds 145,280 GB (~142 TB) of GPU VRAM across its 6,212 cards. With 46% of GPU compute capacity idle at snapshot time, approximately 67 TB of VRAM sits on GPUs that are not actively running any workload — or running workloads that occupy only a fraction of the card's memory capacity.

Metric	Value	Source
Total cluster VRAM	145,280 GB	openb_node_list_gpu_node.csv
GPU compute idle fraction (snapshot)	46.0%	Running pods × gpu_milli ÷ total capacity
VRAM on idle / fragmented GPUs	~66,800 GB (67 TB)	computed: 46% × 145,280 GB
Mean VRAM per GPU node	119.8 GB	openb_node_list_gpu_node.csv
Mean gpu_milli per GPU job	809 milli (80.9% of GPU)	openb_pod_list_default.csv
Mean gpu_milli for fractional jobs	327 milli (32.7% of GPU)	jobs with gpu_milli < 1000

The memory-compute mismatch: GPU memory allocation in Kubernetes does not automatically follow gpu_milli proportionally — many frameworks allocate device-level VRAM pages in blocks. A job requesting 200m GPU compute may pin far more than 20% of the card's VRAM, because CUDA contexts, model weights, and framework buffers are not fractional. The residual VRAM that isn't consumed by the job's allocation is unavailable to other workloads: it's locked to the device, but neither used by the job nor returnable to a shared pool.

Finding 4

861 jobs never ran — starvation at full-capacity appearance

The most direct evidence that fragmentation creates real harm is the Pending pod count: 861 GPU jobs were submitted to the cluster and never scheduled at all. An additional 1,870 GPU jobs failed — for a combined 38.5% of all GPU jobs that did not complete successfully.

GPU job scheduling outcomes — all 7,064 GPU pods

Running (active)

58.7%

Failed

26.5%

Pending (never ran)

12.2%

Succeeded

2.6%

n = 7,064 GPU pods · source: openb_pod_list_default.csv (pod_phase column)

38.7% of GPU jobs — 2,731 out of 7,064 — never completed successfully. Meanwhile, 46% of cluster GPU capacity sat idle. This is the defining symptom of fragmentation: a cluster that looks full (GPU nodes allocated) but is structurally unable to serve demand because the remaining capacity is in the wrong shape.

Methodology

How utilisation and fragmentation are measured

Unlike telemetry-based datasets (per-minute GPU utilisation from nvidia-smi), the PAI traces record scheduling state. GPU capacity utilisation is therefore derived from the sum of allocated gpu_milli across all running pods, divided by total cluster capacity.

        -- Cluster-level GPU utilisation from scheduling state

        allocated_milli = SUM(num_gpu × 1000) FOR full-GPU jobs

                    + SUM(gpu_milli) FOR fractional jobs

                    WHERE pod_phase = 'Running'

        capacity_milli = SUM(gpu) × 1000 FROM node_list

        utilisation   = allocated_milli / capacity_milli      -- 54.0%

        idle_fraction = 1 − utilisation                       -- 46.0%

Pipeline

Load node and pod CSV files

Node file provides per-node GPU count and model (e.g., G2, T4, V100M32). Pod file provides per-job GPU request (num_gpu + gpu_milli), QoS tier, and scheduling outcome (pod_phase).

Compute total cluster capacity (milli-GPU)

Sum gpu × 1000 across all GPU nodes → 6,212,000 milli-GPU total.

Sum allocated milli-GPU for Running pods

Full-GPU jobs contribute num_gpu × 1000; fractional jobs contribute their gpu_milli value directly. Sum across 4,149 Running GPU pods.

Derive idle fraction and fragmentation metrics

idle_fraction = 1 − (allocated / capacity). Fractional-job waste = average of (1 − gpu_milli/1000) across all fractional GPU jobs.

Count Pending and Failed pods

Directly from pod_phase column: 861 Pending + 1,870 Failed GPU pods confirm that idle capacity does not translate into served demand.

Implications

What fragmentation means for your cluster

🧩

False scarcity from fragmentation

The scheduler reports GPUs as allocated. Queued jobs see no available nodes. But 46% of GPU compute sits unused inside those "allocated" cards — in mismatched fractions that no new job requests.

🧠

VRAM locked without proportional use

GPU memory does not fragment cleanly. A fractional-compute job often pins disproportionate VRAM — locking 67 TB of device memory to workloads that aren't using it fully.

📊

Peer-reviewed production evidence

This is first-party data from Alibaba's production cluster, published at USENIX ATC'23. It independently validates that fragmentation is a real, measurable, cluster-scale problem — not a theoretical concern.

The Affinode Approach

Reclaim the fragments without changing job code

Fragmentation accumulates because there is no mechanism to temporarily suspend a fractional-GPU job, return its GPU slot to the pool, and restore the job when a full-GPU-equivalent slot becomes available. Affinode introduces exactly that primitive via CUDA checkpoint/restore.

Identify fragmented GPU slots in real time

Monitor gpu_milli allocation per node. Nodes with residual capacity below a threshold (e.g., < 200m remaining) are fragmented candidates: they hold jobs but cannot absorb new requests.

Checkpoint one fractional job to free the full slot

Serialise the fractional job's GPU state to host DRAM or NVMe. The entire GPU is now free — not a residual fraction of it. The freed card can be assigned to any queued job regardless of size.

Schedule a Pending job onto the reclaimed GPU

One of the 861 queued jobs that could not fit due to fragmentation now has a full, clean GPU slot available. No requeue, no delay, no changes to the submitted job spec.

Restore the checkpointed job when a slot re-opens

When the newly-scheduled job completes or another GPU frees up, Affinode restores the checkpointed fractional job transparently. The job resumes exactly where it paused — no lost progress.

Applied to this cluster: checkpoint-and-restore on even a fraction of the fragmented capacity would unlock GPUs for the 861 Pending jobs — reducing the failed/stuck job rate from 38.7% toward zero, without adding a single GPU or changing any job submission workflow.

Reproducibility

Run it yourself

All data is publicly available from the Alibaba GitHub repository. The two core CSV files are under 650 KB combined. The analysis runs in under 30 seconds on a laptop.

# Download CSVs directly (no LFS, no authentication)

curl -L https://raw.githubusercontent.com/alibaba/clusterdata/

    master/cluster-trace-gpu-v2023/csv/openb_node_list_gpu_node.csv -o nodes.csv

curl -L https://raw.githubusercontent.com/alibaba/clusterdata/

    master/cluster-trace-gpu-v2023/csv/openb_pod_list_default.csv -o pods.csv

# Run analysis

pip install pandas numpy

python3 analyze.py

Dataset: alibaba/clusterdata · PAI GPU v2023 · Weng et al., USENIX ATC'23 · Analysis script and methodology available at hello@affinode.io.

46% of GPU capacity idle while 861 jobs queue — fragmentation is why

The dataset

Cluster at a glance

GPU allocation model

46% of cluster GPU capacity is idle at snapshot time

Fractional GPU allocation leaves 67% residual per card

2,000

1,600+

67 TB of GPU memory associated with idle capacity

861 jobs never ran — starvation at full-capacity appearance

How utilisation and fragmentation are measured

What fragmentation means for your cluster

False scarcity from fragmentation

VRAM locked without proportional use

Peer-reviewed production evidence

Reclaim the fragments without changing job code

Run it yourself

46% of GPU capacity idle
while 861 jobs queue
— fragmentation is why