PCcheck: Persistent Concurrent Checkpointing for ML
Training
large-scale machine learning (ML) models is expensive and
time-intensive, consuming many hardware accelerators for days or weeks.
As the scale of hardware deployments and training time continue to grow,
the probability of failures also ...ACM DL Link
- KKaru Sankaralingam @karu
Reviewer Persona: The Guardian (Adversarial Skeptic)
Summary
The paper presents PCcheck, a framework designed to mitigate the overhead of checkpointing in large-scale ML training. The central thesis is that existing mechanisms, such as CheckFreq and Gemini, are bottlenecked by their inability to handle more than one in-flight checkpoint at a time. To address this, PCcheck introduces support for multiple concurrent checkpoints, pipelining the process from GPU to CPU DRAM and then to persistent storage (SSD or PMEM) using multiple writer threads. The authors evaluate PCcheck across a range of models and claim it enables frequent checkpointing with minimal overhead (3%) and achieves significantly higher training goodput (up to 2.86x) in environments with frequent preemptions compared to state-of-the-art systems.
While the core idea is intuitive, the evaluation and analysis contain significant methodological weaknesses and unsubstantiated claims that call the paper's conclusions into question. The work rests on a narrow empirical foundation and fails to adequately address the systemic challenges introduced by its own design, particularly in a distributed context.
Strengths
- The paper addresses a well-recognized and important problem in large-scale ML training: the tension between checkpoint frequency and performance overhead.
- The core mechanism of decoupling checkpoint initiation from completion to allow for concurrent persistence is a logical next step in optimizing these pipelines.
- The evaluation is broad in its choice of models, covering both vision and NLP tasks of varying sizes, and considers two different persistent storage media (SSD and PMEM).
Weaknesses
My primary concerns with this paper relate to the validity and generalizability of its core claims, which appear to be based on an oversimplified model of the problem and an evaluation that lacks sufficient rigor.
-
Over-generalization from a Single, Limited Failure Trace: The headline claim of achieving "up to 2.86× higher goodput" (Section 2, page 2 and Section 7, page 13) is derived from a simulation based on a single 16-hour preemption trace from a previous study [16]. This is a severe methodological flaw. Preemption patterns in cloud environments are notoriously variable and depend on time of day, data center load, and bidding strategies. A single trace cannot be considered representative of general spot instance behavior. The paper's strongest claim is therefore not generalizable and is only validated for one specific, arbitrary scenario. The authors fail to acknowledge this critical limitation.
-
Unsubstantiated Claims Regarding Distributed Coordination: The paper claims to support distributed training, but the mechanism and its overhead are dismissed with a hand-waving assertion. In Section 4.1 (page 8), the proposed coordination mechanism is described as a blocking operation where all peers report to rank 0 and wait for a notification to proceed. This is a synchronization barrier. The authors claim this "has negligible overhead compared to the actual training" (Section 3.2, page 5), but provide zero evidence to support this. In a large-scale system with frequent checkpointing (e.g., every 10 iterations), the latency of this network-bound barrier could easily become a dominant factor in the overall overhead, yet it is never measured or analyzed. This omission is a critical failure for a systems paper claiming applicability to distributed settings.
-
Inadequate Modeling of Resource Contention: The design of PCcheck inherently introduces contention for CPU memory bandwidth and, more importantly, storage I/O bandwidth. The analytical model presented in Section 3.4 (page 6) is an oversimplification that treats the time to write a checkpoint (
Tw) as a constant. In reality, as the number of concurrent checkpoints (N) increases,Twfor any individual checkpoint will increase due to contention on the storage device. The model completely ignores this effect. The sensitivity study in Figure 12 (page 12) implicitly confirms this, showing diminishing or negative returns beyond 4 concurrent checkpoints. The paper fails to properly model or analyze the primary performance bottleneck that its own design creates. -
Questionable Fidelity of Baseline Implementations: The comparison against Gemini [68] is based on the authors' own re-implementation, as the original is not open-source (Section 5.1, page 9). It is impossible for a reviewer to verify if this re-implementation is fair and optimized. Given that Gemini's performance is highly dependent on network conditions and its internal pipelining strategy, an unoptimized implementation could serve as a strawman. The poor performance of Gemini shown in Figure 8 could be an artifact of the implementation or the specific low-bandwidth network of the testbed, rather than a fundamental flaw in the Gemini design.
-
Inconsistent and Overstated Performance Claims: The abstract claims PCcheck maintains "minimal (3%) overhead," but the paper's own results do not consistently support this at high frequencies. For example, in Figure 8f (page 10), training BLOOM-7B with a checkpoint interval of 25 iterations shows throughput dropping from ~0.082 iters/sec to ~0.077 iters/sec, a slowdown of over 6%. While this is much better than the baselines, it is more than double the "3%" figure advertised in the abstract. Such claims should be stated with precise, qualified conditions, not as a blanket statement.
Questions to Address In Rebuttal
- How can you justify the generalizability of the goodput results (Figure 9) when they are based on a single 16-hour preemption trace? Please provide evidence from other traces or a robust argument for why this specific trace is representative of the broader problem space.
- Please provide empirical data measuring the overhead of your proposed distributed coordination mechanism (the blocking All-to-One and wait operation described in Section 4.1). How does this overhead scale with the number of nodes and the checkpointing frequency?
- Your analytical model in Section 3.4 assumes
Twis independent ofN. Your results in Section 5.4.1 suggest this is false. Please provide a more accurate model that accounts for storage bandwidth contention and validate it against your empirical results. - Please provide further details on your implementation of the Gemini baseline. Specifically, what steps were taken to ensure it was a faithful and optimized reproduction of the system described in the original paper? How does your network testbed compare to the one used in the Gemini paper?
- Please clarify the exact conditions (model, checkpoint size, iteration time, checkpoint frequency) under which the claimed "minimal (3%) overhead" is achieved.
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Reviewer: The Synthesizer (Contextual Analyst)
Summary
This paper presents PCcheck, a system for persistent and concurrent checkpointing in large-scale machine learning (ML) training. The authors identify a critical bottleneck in existing fault-tolerance mechanisms: while frequent checkpointing is necessary to mitigate long recovery times (especially on unreliable resources like spot VMs), current systems introduce significant training overhead because they can only handle one checkpoint operation at a time. A new checkpoint request must wait for the previous one to be fully persisted to storage, causing the GPU to stall.
The core contribution of PCcheck is a well-engineered systems solution that decouples the training process from the persistence process by enabling multiple checkpoint operations to be in-flight simultaneously. By using a multi-buffered, pipelined architecture, PCcheck allows the GPU to continue training while previous checkpoints are being written to persistent storage (SSD or PMEM) in the background by multiple threads. The evaluation demonstrates that this approach dramatically reduces overhead, enabling checkpointing as frequently as every 10 iterations with only ~3% throughput degradation. More importantly, when simulated on a real-world preemption trace from a cloud provider, PCcheck achieves up to 2.86x higher "goodput" (useful training progress) compared to state-of-the-art baselines like CheckFreq and Gemini.
Strengths
-
Addresses a Highly Relevant and Economically Significant Problem: The paper is exceptionally well-motivated. The convergence of massive models, long training times, and the economic appeal of preemptible cloud instances has made efficient fault tolerance a first-order problem, not an afterthought. By framing the issue in terms of "goodput" on spot VMs (Figure 2, page 2), the authors connect their technical contribution directly to a tangible, real-world value proposition: reducing the cost and time of large-scale ML training.
-
Elegant Synthesis of Classic Systems Principles: The core idea behind PCcheck, while not fundamentally novel in the grand scheme of computer systems, is a fantastic example of applying established principles to a new and important domain. The use of concurrency, pipelining, and buffering to hide I/O latency is a cornerstone of high-performance systems, from database logging mechanisms (e.g., ARIES-style write-ahead logging) to decades of work in HPC checkpointing. The authors have successfully identified the specific data flow and bottlenecks of the ML training loop and tailored a solution that fits perfectly. It's a beautiful piece of systems engineering that bridges the gap between high-level ML frameworks and low-level hardware/OS primitives.
-
Strong Contextualization within the ML Systems Landscape: The paper does an excellent job of positioning PCcheck against its direct predecessors, CheckFreq and Gemini. The analysis that CheckFreq is bottlenecked by its single-checkpoint design and that Gemini is bottlenecked by network bandwidth in typical cloud environments (Section 5.2.1, page 10) is insightful. This demonstrates a clear understanding of the specific limitations of prior art and provides a compelling narrative for why a new approach—one that re-embraces and optimizes for local persistent storage—is necessary.
-
Comprehensive and Convincing Evaluation: The experimental methodology is a major strength. The use of a real-world preemption trace from André et al. [16] elevates the evaluation from a simple performance benchmark to a compelling simulation of real-world utility. Furthermore, the evaluation across multiple models (vision and NLP), scales (single-node to multi-node), and storage media (SSD and PMEM) demonstrates the robustness and general applicability of the proposed techniques.
Weaknesses
While the work is strong, its positioning could be broadened to acknowledge its deep roots in other fields, which would further strengthen its contribution.
-
Limited Connection to Broader Systems Literature: The paper primarily frames itself within the context of recent ML systems papers. However, the problem of concurrent, asynchronous checkpointing has been studied extensively in the HPC and Database communities. While the specific constraints of ML training (e.g., state is primarily model weights, GPU as the compute engine) are unique, explicitly connecting PCcheck's design to this broader history would help contextualize the work. For instance, the multi-buffered approach is reminiscent of classic double-buffering schemes used to overlap I/O and computation. Acknowledging these parallels would not diminish the novelty but rather show how enduring systems principles are being successfully adapted to the ML era.
-
Potential Underestimation of Resource Contention: The paper convincingly shows that PCcheck reduces GPU idle time. However, the cost of this is increased activity on other system resources: CPU cycles for the orchestrator and persistence threads, DRAM capacity for buffers, and contention on the PCIe bus and storage I/O channel. The authors briefly allude to this when discussing input-bound vision models (Section 4, page 7), but a more in-depth analysis of these trade-offs would be valuable. In a scenario where a training workload is already heavily utilizing the CPU for data preprocessing or the disk for dataset streaming, how does PCcheck's added load impact overall performance?
-
The Distributed Coordination Mechanism is Pragmatic but Simple: The proposed mechanism for distributed training relies on a rank 0 worker to coordinate a globally consistent checkpoint (Section 4.1, page 8). This is a practical and common solution, but it represents a potential single point of failure and a scalability bottleneck for training jobs running on thousands of nodes. The paper would benefit from a brief discussion of the limitations of this approach and potential avenues for more decentralized or robust coordination in future work.
Questions to Address In Rebuttal
-
Could the authors comment on the relationship between PCcheck's design and established techniques in HPC (e.g., multi-level checkpointing) or database recovery (e.g., asynchronous logging)? Acknowledging these connections could help place the work in a richer historical context.
-
The distributed coordination protocol relies on a rank 0 aggregator. Can the authors discuss the potential performance and fault-tolerance implications of this design choice as the number of nodes scales into the hundreds or thousands?
-
PCcheck offloads persistence work from the GPU to the CPU and storage subsystem. In workloads that are not purely GPU-bound (e.g., those with heavy data preprocessing on the CPU or streaming from disk), have you analyzed the potential for resource contention between the training pipeline and the PCcheck persistence threads? How does PCcheck manage or mitigate this?
-
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Reviewer: The Innovator (Novelty Specialist)
Summary
This paper introduces PCcheck, a framework designed to enable frequent, low-overhead checkpointing for large-scale ML model training. The authors identify that existing state-of-the-art systems, such as CheckFreq and Gemini, become a bottleneck at high checkpointing frequencies because they can only handle one checkpoint operation at a time. The core claim of novelty in PCcheck is its ability to orchestrate multiple concurrent checkpoints in parallel. The system pipelines data from GPU to host DRAM and then uses multiple threads to persist the data to durable storage (SSD or PMEM), allowing training to continue with minimal stalling while several checkpoint operations are in flight. The authors demonstrate that this approach significantly improves training "goodput" in environments with frequent failures, such as those using preemptible cloud VMs.
Strengths
The primary strength of this work lies in its clear identification of a practical and increasingly important bottleneck in ML systems. The authors correctly diagnose the limitation of single in-flight checkpoint mechanisms (as depicted in Figure 4, page 3) and propose a direct solution. The experimental results, particularly the goodput evaluation using a real-world preemption trace (Figure 9, page 11), convincingly argue for the utility of the proposed mechanism, showing substantial gains over existing systems.
Weaknesses
My evaluation, focused exclusively on novelty, finds that the core conceptual contribution of this paper is limited. The central idea—using concurrent, asynchronous I/O to overlap computation with slow persistence operations—is a foundational pattern in high-performance computing and database systems, and is not a new invention.
-
Extensive Prior Art in Overlapping I/O and Computation: The principle of hiding I/O latency through concurrency is not novel. High-Performance Computing (HPC) has employed multi-level and asynchronous checkpointing for decades. Systems like SCR (Scalable Checkpoint/Restart) and VeloC explicitly use a multi-stage persistence pipeline (e.g., memory -> node-local SSD -> parallel file system) where data is migrated between stages asynchronously to minimize application stalls. While the authors' implementation is tailored to the GPU-DRAM-SSD pipeline of a modern ML server, the architectural pattern is functionally identical to what has long existed in the HPC domain. The novelty is therefore one of application and engineering, not a fundamental new concept.
-
Composition of Standard Techniques: The implementation of PCcheck, as described in Section 3 ("PCcheck Design"), appears to be a composition of well-known systems programming techniques. The use of lock-free queues to manage memory buffers for checkpoints and thread pools to perform parallel writes to storage are standard patterns for building high-performance I/O subsystems. While the orchestration of these components for the specific problem of ML checkpointing is the work of the authors, the underlying building blocks are not novel.
-
Delta Over Prior Art is Incremental: The key difference between PCcheck and its closest cited competitors (CheckFreq) is the move from
N=1concurrent checkpoint toN>1. While the performance benefits of this change are significant, the conceptual leap is small. It represents the next logical step in optimizing the pipeline rather than a paradigm shift in how checkpointing is performed. The paper essentially applies a more robust and scalable I/O pattern to a problem where previous solutions had used a simpler, more constrained version of that same pattern.
Questions to Address In Rebuttal
-
The core idea of managing multiple asynchronous I/O operations to hide latency is central to multi-level checkpointing in the HPC domain. Could the authors please articulate the fundamental conceptual difference between PCcheck's concurrent checkpointing and the asynchronous, multi-stage persistence pipelines used by established HPC fault-tolerance frameworks? Is the novelty purely in the application to the ML training loop, or is there a new algorithmic or architectural principle at play that I have missed?
-
The storage requirement for PCcheck is
(N+1) * m(Table 1, page 5), which for a largem(e.g., BLOOM-7B's 108 GB checkpoint size) andN=4concurrent checkpoints would require over 500 GB of fast persistent storage just for checkpointing. This seems to trade a performance problem for a potentially significant storage cost problem. Can the authors comment on whether this overhead makes the approach impractical for models that are orders of magnitude larger, and if the novelty of the approach is therefore limited to a specific scale of model? -
The algorithm presented in Listing 1 (page 7) involves several synchronization primitives (atomic add, CAS, barriers) to manage the pointer to the latest consistent checkpoint. Given the complexity of concurrent programming, does this introduce new, subtle failure modes compared to the simpler single-checkpoint-at-a-time approach? Is the added complexity of correctness verification a significant trade-off for the performance benefit?
-