Case Study · Philly Cluster Dataset
An analysis of 93,451 DNN training jobs on Microsoft's internal Philly cluster reveals that exclusive GPU reservations waste between one quarter and half of all compute time — and that the idle periods are genuine mid-job pauses, not measurement artifacts.
Background
The Philly traces are a sanitised snapshot of first-party DNN training workloads on Microsoft's internal GPU cluster, published alongside the ATC'19 paper "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads". The trace covers 117,325 jobs submitted over roughly five months and is one of the most comprehensive public records of real-world ML infrastructure behaviour.
Each job is recorded with per-attempt machine and GPU assignments. A separate file
captures per-minute GPU utilisation (0–100 %) from nvidia-smi
for every machine in the cluster over the same period.
Methodology
We define idle conservatively: a (machine, minute) slot is idle only when all GPUs the job was assigned on that machine report exactly 0 % utilisation for that minute. A single active GPU saves the entire minute. Minutes without telemetry are excluded from both numerator and denominator.
Pipeline
Jobs may have multiple scheduling attempts due to preemption or node failure. Only the chronologically last attempt (by end time) is used — capturing the job's final, successful execution window.
The 2.8 GB raw CSV is converted once to a 426 MB ZSTD-compressed Parquet file for columnar access. Subsequent runs reuse the cache.
Each (job, machine) attempt row is joined to GPU util rows where
machine = machine AND ts ∈ [start, end).
The join runs in DuckDB with 8 threads.
Idle and total machine-minutes are summed per job to produce
idle_fraction ∈ [0, 1],
then grouped across six wall-clock duration buckets.
A separate verification script independently re-derives idle counts for a random sample using plain Python + direct Parquet queries. 20/20 sampled jobs matched to the integer.
Results
Across all 93,451 jobs with GPU telemetry in their last attempt window, idle fractions range from 25 % (the longest jobs) to over 51 % (medium-length jobs). Short and medium jobs — the most common workload category — waste roughly half of their allocated GPU time.
| Duration bucket | Jobs | Total GPU-hours | Idle machine-min | Idle % |
|---|---|---|---|---|
| <1 min | 6,348 | 116 | 3,420 | |
| 1 – 10 min | 24,537 | 1,487 | 43,118 | |
| 10 – 60 min | 32,352 | 15,855 | 477,097 | |
| 1 – 6 h | 18,732 | 46,572 | 1,154,317 | |
| 6 – 24 h | 5,564 | 71,797 | 1,533,275 | |
| >24 h | 5,918 | 808,917 | 7,292,278 |
Short and medium jobs (< 6 hours) account for 81,969 jobs — 88 % of the workload — and are idle roughly 49–51 % of their observed machine-minutes. Even the longest jobs (> 24 h), which represent the most sustained training runs, waste 1-in-4 GPU-minutes.
Key Finding
A natural concern is that "idle" minutes simply reflect jobs that have finished computing but haven't released their GPU reservation yet — a cleanup delay, not real waste. We tested this directly.
of job+machine pairs with mixed idle/non-idle minutes show at least one idle period followed by active minutes — the job genuinely paused, then resumed computation.
fit the "ran then went idle" pattern — the job finished computing but held its GPU allocation until the scheduler reclaimed it. Also recoverable, but a different mechanism.
The dominant pattern is genuine mid-job inactivity: data loading between epochs, evaluation passes, checkpointing I/O, waiting on external data pipelines, or researchers stepping away from interactive sessions while keeping their allocation. These are exactly the windows Affinode's yield-and-resume targets.
Implications
When jobs hold idle GPUs, other jobs queue. The cluster appears full while roughly half its compute sits unused. Yield-and-resume eliminates this contradiction.
At cloud GPU rates ($2–6/hr per A100), a 49 % idle rate on a 100-GPU cluster represents $1–3 M/year in reserved-but-idle compute — recoverable without changing a single line of user code.
This is peer-reviewed, third-party data (ATC'19) from a real production cluster. It independently validates Affinode's core claim that ML workloads average 30–50 % GPU utilisation.
Reproducibility
The full analysis is open source. Download the Philly traces from the msr-fiddle/philly-traces repository, then run:
Source and methodology notes: github.com/msr-fiddle/philly-traces · Weng et al., ATC'19 · Analysis scripts available on request at hello@affinode.io.