Case Study · Philly Cluster Dataset

GPUs sit idle
25–51% of the time
in production ML clusters

An analysis of 93,451 DNN training jobs on Microsoft's internal Philly cluster reveals that exclusive GPU reservations waste between one quarter and half of all compute time — and that the idle periods are genuine mid-job pauses, not measurement artifacts.

Dataset Microsoft Philly Cluster (ATC'19)

Period Aug – Dec 2017

Jobs analysed 93,451

GPU-minutes observed 36.4 M

Background

The dataset

The Philly traces are a sanitised snapshot of first-party DNN training workloads on Microsoft's internal GPU cluster, published alongside the ATC'19 paper "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads". The trace covers 117,325 jobs submitted over roughly five months and is one of the most comprehensive public records of real-world ML infrastructure behaviour.

Each job is recorded with per-attempt machine and GPU assignments. A separate file captures per-minute GPU utilisation (0–100 %) from nvidia-smi for every machine in the cluster over the same period.

Cluster at a glance

552 machines, up to 8 GPUs each
117,325 total submitted jobs
44.7 M GPU-util rows in the telemetry
Aug 2017 – Dec 2017 (5 months)

Scheduling model

Exclusive GPU assignment per job
Jobs may be preempted and rescheduled (multiple attempts)
No GPU time-sharing or MIG
Typical of academic and corporate ML clusters today

Methodology

How idle time is measured

We define idle conservatively: a (machine, minute) slot is idle only when all GPUs the job was assigned on that machine report exactly 0 % utilisation for that minute. A single active GPU saves the entire minute. Minutes without telemetry are excluded from both numerator and denominator.

Why this matters: This is a stricter definition than most prior work. Reporting idle at the per-GPU level (any one GPU = 0 %) would yield higher numbers. Our metric asks: was the job making zero progress on any of its assigned hardware this minute? That is the recoverable window Affinode targets.

        -- A minute is idle only when every assigned GPU is simultaneously at 0 %

        idle = (NOT has_gpu0 OR gpu0 = 0)

     AND (NOT has_gpu1 OR gpu1 = 0)

     AND ...

     AND (NOT has_gpu7 OR gpu7 = 0)

        idle_fraction = idle_machine_minutes / total_observed_machine_minutes

Pipeline

Select the last attempt per job

Jobs may have multiple scheduling attempts due to preemption or node failure. Only the chronologically last attempt (by end time) is used — capturing the job's final, successful execution window.

Convert GPU telemetry to Parquet

The 2.8 GB raw CSV is converted once to a 426 MB ZSTD-compressed Parquet file for columnar access. Subsequent runs reuse the cache.

Range join attempts × telemetry

Each (job, machine) attempt row is joined to GPU util rows where machine = machine AND ts ∈ [start, end). The join runs in DuckDB with 8 threads.

Aggregate per job, summarise by duration bucket

Idle and total machine-minutes are summed per job to produce idle_fraction ∈ [0, 1], then grouped across six wall-clock duration buckets.

Cross-examination

A separate verification script independently re-derives idle counts for a random sample using plain Python + direct Parquet queries. 20/20 sampled jobs matched to the integer.

Results

Idle time by job duration

Across all 93,451 jobs with GPU telemetry in their last attempt window, idle fractions range from 25 % (the longest jobs) to over 51 % (medium-length jobs). Short and medium jobs — the most common workload category — waste roughly half of their allocated GPU time.

Duration bucket	Jobs	Total GPU-hours	Idle machine-min	Idle %
<1 min	6,348	116	3,420	49.3 %
1 – 10 min	24,537	1,487	43,118	48.1 %
10 – 60 min	32,352	15,855	477,097	51.1 %
1 – 6 h	18,732	46,572	1,154,317	49.2 %
6 – 24 h	5,564	71,797	1,533,275	37.0 %
>24 h	5,918	808,917	7,292,278	25.3 %

Short and medium jobs (< 6 hours) account for 81,969 jobs — 88 % of the workload — and are idle roughly 49–51 % of their observed machine-minutes. Even the longest jobs (> 24 h), which represent the most sustained training runs, waste 1-in-4 GPU-minutes.

Key Finding

Idle time is mid-job, not just a trailing artifact

A natural concern is that "idle" minutes simply reflect jobs that have finished computing but haven't released their GPU reservation yet — a cleanup delay, not real waste. We tested this directly.

65%

of job+machine pairs with mixed idle/non-idle minutes show at least one idle period followed by active minutes — the job genuinely paused, then resumed computation.

35%

fit the "ran then went idle" pattern — the job finished computing but held its GPU allocation until the scheduler reclaimed it. Also recoverable, but a different mechanism.

The dominant pattern is genuine mid-job inactivity: data loading between epochs, evaluation passes, checkpointing I/O, waiting on external data pipelines, or researchers stepping away from interactive sessions while keeping their allocation. These are exactly the windows Affinode's yield-and-resume targets.

What this means operationally: A system that can checkpoint GPU state during idle windows and restore it when the job resumes would recover this capacity without disrupting the user or requiring any code changes — which is precisely what Affinode does via NVIDIA's CUDA checkpoint API.

Implications

What this means for your cluster

⚡

False scarcity

When jobs hold idle GPUs, other jobs queue. The cluster appears full while roughly half its compute sits unused. Yield-and-resume eliminates this contradiction.

💰

Direct cost waste

At cloud GPU rates ($2–6/hr per A100), a 49 % idle rate on a 100-GPU cluster represents $1–3 M/year in reserved-but-idle compute — recoverable without changing a single line of user code.

📊

Independent evidence

This is peer-reviewed, third-party data (ATC'19) from a real production cluster. It independently validates Affinode's core claim that ML workloads average 30–50 % GPU utilisation.

Reproducibility

Run it yourself

The full analysis is open source. Download the Philly traces from the msr-fiddle/philly-traces repository, then run:

# Install dependency

pip install duckdb

# Run the analysis (first run converts CSV → Parquet, ~2 min)

python3 analysis/idle_time_by_duration.py \

    --data-dir trace-data-extracted \

    --out trace-data-extracted/idle_by_duration.csv \

    --threads 8

# Cross-examine a random sample to verify correctness

python3 analysis/cross_examine.py \

    --data-dir trace-data-extracted \

    --idle-csv trace-data-extracted/idle_by_duration.csv \

    --sample 20

Source and methodology notes: github.com/msr-fiddle/philly-traces · Weng et al., ATC'19 · Analysis scripts available on request at hello@affinode.io.

GPUs sit idle 25–51% of the time in production ML clusters

The dataset

Cluster at a glance

Scheduling model

How idle time is measured

Idle time by job duration

Idle time is mid-job, not just a trailing artifact

65%

35%

What this means for your cluster

False scarcity

Direct cost waste

Independent evidence

Run it yourself

GPUs sit idle
25–51% of the time
in production ML clusters