affinode ← Back

Case Study · Philly Cluster Dataset

GPUs sit idle
25–51% of the time
in production ML clusters

An analysis of 93,451 DNN training jobs on Microsoft's internal Philly cluster reveals that exclusive GPU reservations waste between one quarter and half of all compute time — and that the idle periods are genuine mid-job pauses, not measurement artifacts.

Dataset Microsoft Philly Cluster (ATC'19)
Period Aug – Dec 2017
Jobs analysed 93,451
GPU-minutes observed 36.4 M
49%
avg idle — short jobs (<6 h)
25%
idle — even long jobs (>24 h)
65%
of idle periods are mid-job
7.3M
idle GPU-minutes in >24 h jobs alone

The dataset

The Philly traces are a sanitised snapshot of first-party DNN training workloads on Microsoft's internal GPU cluster, published alongside the ATC'19 paper "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads". The trace covers 117,325 jobs submitted over roughly five months and is one of the most comprehensive public records of real-world ML infrastructure behaviour.

Each job is recorded with per-attempt machine and GPU assignments. A separate file captures per-minute GPU utilisation (0–100 %) from nvidia-smi for every machine in the cluster over the same period.

Cluster at a glance

  • 552 machines, up to 8 GPUs each
  • 117,325 total submitted jobs
  • 44.7 M GPU-util rows in the telemetry
  • Aug 2017 – Dec 2017 (5 months)

Scheduling model

  • Exclusive GPU assignment per job
  • Jobs may be preempted and rescheduled (multiple attempts)
  • No GPU time-sharing or MIG
  • Typical of academic and corporate ML clusters today

How idle time is measured

We define idle conservatively: a (machine, minute) slot is idle only when all GPUs the job was assigned on that machine report exactly 0 % utilisation for that minute. A single active GPU saves the entire minute. Minutes without telemetry are excluded from both numerator and denominator.

Why this matters: This is a stricter definition than most prior work. Reporting idle at the per-GPU level (any one GPU = 0 %) would yield higher numbers. Our metric asks: was the job making zero progress on any of its assigned hardware this minute? That is the recoverable window Affinode targets.
-- A minute is idle only when every assigned GPU is simultaneously at 0 %
idle = (NOT has_gpu0 OR gpu0 = 0)
     AND (NOT has_gpu1 OR gpu1 = 0)
     AND ...
     AND (NOT has_gpu7 OR gpu7 = 0)

idle_fraction = idle_machine_minutes / total_observed_machine_minutes
1
Select the last attempt per job

Jobs may have multiple scheduling attempts due to preemption or node failure. Only the chronologically last attempt (by end time) is used — capturing the job's final, successful execution window.

2
Convert GPU telemetry to Parquet

The 2.8 GB raw CSV is converted once to a 426 MB ZSTD-compressed Parquet file for columnar access. Subsequent runs reuse the cache.

3
Range join attempts × telemetry

Each (job, machine) attempt row is joined to GPU util rows where machine = machine AND ts ∈ [start, end). The join runs in DuckDB with 8 threads.

4
Aggregate per job, summarise by duration bucket

Idle and total machine-minutes are summed per job to produce idle_fraction ∈ [0, 1], then grouped across six wall-clock duration buckets.

5
Cross-examination

A separate verification script independently re-derives idle counts for a random sample using plain Python + direct Parquet queries. 20/20 sampled jobs matched to the integer.

Idle time by job duration

Across all 93,451 jobs with GPU telemetry in their last attempt window, idle fractions range from 25 % (the longest jobs) to over 51 % (medium-length jobs). Short and medium jobs — the most common workload category — waste roughly half of their allocated GPU time.

Duration bucket Jobs Total GPU-hours Idle machine-min Idle %
<1 min 6,348 116 3,420
49.3 %
1 – 10 min 24,537 1,487 43,118
48.1 %
10 – 60 min 32,352 15,855 477,097
51.1 %
1 – 6 h 18,732 46,572 1,154,317
49.2 %
6 – 24 h 5,564 71,797 1,533,275
37.0 %
>24 h 5,918 808,917 7,292,278
25.3 %

Short and medium jobs (< 6 hours) account for 81,969 jobs — 88 % of the workload — and are idle roughly 49–51 % of their observed machine-minutes. Even the longest jobs (> 24 h), which represent the most sustained training runs, waste 1-in-4 GPU-minutes.

Idle time is mid-job, not just a trailing artifact

A natural concern is that "idle" minutes simply reflect jobs that have finished computing but haven't released their GPU reservation yet — a cleanup delay, not real waste. We tested this directly.

65%

of job+machine pairs with mixed idle/non-idle minutes show at least one idle period followed by active minutes — the job genuinely paused, then resumed computation.

35%

fit the "ran then went idle" pattern — the job finished computing but held its GPU allocation until the scheduler reclaimed it. Also recoverable, but a different mechanism.

The dominant pattern is genuine mid-job inactivity: data loading between epochs, evaluation passes, checkpointing I/O, waiting on external data pipelines, or researchers stepping away from interactive sessions while keeping their allocation. These are exactly the windows Affinode's yield-and-resume targets.

What this means operationally: A system that can checkpoint GPU state during idle windows and restore it when the job resumes would recover this capacity without disrupting the user or requiring any code changes — which is precisely what Affinode does via NVIDIA's CUDA checkpoint API.

What this means for your cluster

False scarcity

When jobs hold idle GPUs, other jobs queue. The cluster appears full while roughly half its compute sits unused. Yield-and-resume eliminates this contradiction.

💰

Direct cost waste

At cloud GPU rates ($2–6/hr per A100), a 49 % idle rate on a 100-GPU cluster represents $1–3 M/year in reserved-but-idle compute — recoverable without changing a single line of user code.

📊

Independent evidence

This is peer-reviewed, third-party data (ATC'19) from a real production cluster. It independently validates Affinode's core claim that ML workloads average 30–50 % GPU utilisation.

Run it yourself

The full analysis is open source. Download the Philly traces from the msr-fiddle/philly-traces repository, then run:

# Install dependency
pip install duckdb

# Run the analysis (first run converts CSV → Parquet, ~2 min)
python3 analysis/idle_time_by_duration.py \
    --data-dir trace-data-extracted \
    --out trace-data-extracted/idle_by_duration.csv \
    --threads 8

# Cross-examine a random sample to verify correctness
python3 analysis/cross_examine.py \
    --data-dir trace-data-extracted \
    --idle-csv trace-data-extracted/idle_by_duration.csv \
    --sample 20

Source and methodology notes: github.com/msr-fiddle/philly-traces · Weng et al., ATC'19 · Analysis scripts available on request at hello@affinode.io.