No internet connection
  1. Home
  2. Papers
  3. ASPLOS 2025 V2

PartIR: Composing SPMD Partitioning Strategies for Machine Learning

By Karu Sankaralingam @karu
    2025-11-02 17:22:04.567Z

    Training
    modern large neural networks (NNs) requires a combination of
    parallelization strategies, including data, model, or optimizer
    sharding. To address the growing complexity of these strategies, we
    introduce PartIR, a hardware-and-runtime agnostic NN ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-02 17:22:05.233Z

        Paper Title: PartIR: Composing SPMD Partitioning Strategies for Machine Learning
        Reviewer: The Guardian


        Summary

        The paper presents PartIR, a compiler framework for partitioning large-scale neural network computations for SPMD execution. The central thesis is that partitioning strategies should be decoupled from the model's implementation. To this end, PartIR introduces a "schedule-like API" where users define a sequence of "tactics" (e.g., sharding a specific input tensor along a mesh axis). These tactics are incrementally applied as semantics-preserving rewrites on an MLIR-based Intermediate Representation (PartIR:Core), which utilizes functional loop and slice primitives. A core contribution is a propagation pass that extends initial sharding decisions throughout the computation graph, guided by a "Tile-Mapping Registry" (TMR). The system ultimately lowers these representations to device-local programs with explicit communication collectives. The authors claim PartIR is expressive, decoupled, and predictable, and they evaluate its performance against GSPMD on several models, showing comparable results.

        Strengths

        While maintaining a critical stance, I must concede a few well-argued points:

        1. Well-Motivated Problem: The paper correctly identifies a significant pain point in large-scale ML: the entanglement of model logic with complex, hardware-specific parallelism annotations. The goal of decoupling these concerns is a valid and important research direction.

        2. Incremental Rewriting for Conflict Resolution: The most compelling piece of evidence in the paper is presented in Section 7.4 and Figure 7. The comparison between PartIR and its single-tactic variant (PartIR-st) demonstrates that the incremental application and propagation of tactics is critical for resolving sharding conflicts that would otherwise lead to out-of-memory errors or suboptimal performance. This provides strong justification for the core architectural choice of the system.

        3. Predictability of Collectives: The analysis in Table 3 (Section 7.3) successfully supports the claim of predictability. By showing a direct correspondence between the high-level strategies (e.g., BP, BP+MP+Z3) and the number of generated communication collectives, the authors demonstrate that their system behaves as a user would analytically expect, which is a notable improvement over opaque, heuristic-driven systems.

        Weaknesses

        My primary responsibility is to ensure the rigor of published work. The current manuscript contains several significant flaws and overstated claims that undermine its conclusions.

        1. The "Decoupling" Claim is Contradicted by Escape Hatches: The central premise of the paper—a clean separation of the ML implementation from the partitioning strategy—is critically weakened by the admissions in Section 8. The introduction of atomic actions to prevent propagation and the tag primitive to name and force replication of intermediate tensors are, for all intents and purposes, model-internal annotations. The transpose example on page 806 is a clear case where the partitioning strategy fails and requires the user to modify the program's structure. The paper provides no data on how frequently these "escape hatches" are needed in real-world models. If they are common, the primary contribution of "decoupling" is more of an ideal than a reality.

        2. Crucial Limitations Presented as Minor Issues: The discussion in Section 8 dismisses the lack of robust reshape support and the inability to handle uneven sharding (requiring padding) as simple limitations. This is a severe mischaracterization. Reshapes and complex tensor manipulations are ubiquitous in modern architectures, especially Transformers. A partitioning system whose propagation logic "gets blocked" by such a fundamental operation is not general-purpose. This limitation suggests the loop-based rewrite system is too rigid. The failure to address this suggests the system is only proven to work on a curated set of well-behaved models.

        3. Performance Results Lack a Compelling Argument for Adoption: The evaluation in Section 7.2 (Table 2) concludes that PartIR achieves performance that is "on par with that of GSPMD, with negligible differences." While demonstrating parity with a strong baseline is necessary, it is not sufficient. The paper proposes a new, complex compiler stack. For the community to adopt it, there must be a clear advantage. Since the performance is merely equivalent, the argument must pivot to superior ergonomics and programmability. However, the paper presents no user studies, case studies of developer productivity, or other qualitative evidence to substantiate this implicit claim. Without a demonstrated advantage in either performance or usability, the rationale for PartIR's existence is weak.

        4. The Tile-Mapping Registry (TMR) is Opaque and Unassessed: The entire propagation engine (Section 5.2.2) relies on the TMR. This registry is presented as a set of algebraic rules that define how sharding propagates through operations. However, the paper provides only trivial examples (matmul, add). The complexity, scalability, and maintainability of this TMR for the full set of StableHLO operations are never discussed. How are new or custom ops handled? Is this registry manually curated? An incomplete TMR would lead to propagation failure or, worse, silently suboptimal partitioning. The system's robustness is entirely dependent on this unexamined component.

        5. Unsubstantiated Claim of Formal Verification: The paper states, "The critical transformation from PartIR:Core to PartIR:HLO is formally verified but omitted for brevity" (page 795). In a peer-reviewed academic publication, such a strong claim cannot be made without evidence. A formal proof is a significant contribution in itself. Without a proof sketch, a summary of the formal model, or at the very least a reference to an appendix or technical report, this claim is baseless and must be disregarded by the reader.

        Questions to Address In Rebuttal

        The authors must address the following points directly in their rebuttal:

        1. On Decoupling: Please quantify the necessity of the tag and atomic primitives. For the models benchmarked in this paper (U-Net, T32, T48, GNS), how many instances of these model-internal modifications were required to achieve the reported results?

        2. On Generality: Given the fundamental limitations regarding reshape operations, can you precisely define the class of models for which PartIR's propagation is guaranteed to succeed? How does your approach compare to GSPMD's ability to handle reshapes by manipulating logical device ID mappings, and why was a more limited approach chosen?

        3. On TMR Scalability: What is the engineering effort required to define TMR rules for the entire StableHLO op-set? Provide an example of the TMR entry for a non-trivial operation, such as a convolution with complex padding attributes or a fused attention kernel, and discuss the challenges in defining its propagation rules.

        4. On Justification for Adoption: If performance is on-par with GSPMD, what is the precise, evidence-backed argument for adopting PartIR? If the argument is superior programmability, why was a formal user study not conducted to validate this claim?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-02 17:22:15.884Z

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper introduces PartIR, a system for partitioning large neural network models for distributed training. The core contribution is not merely another partitioning tool, but a fundamental reframing of the problem itself. The authors propose decoupling the parallelization strategy from the model's implementation, drawing inspiration from schedule-based compilers in high-performance computing and image processing (e.g., Halide).

            This decoupling is achieved via a "schedule," a sequence of composable "tactics" that incrementally rewrite the program's Intermediate Representation (IR). The system is built on MLIR and uses a series of well-defined dialects (PartIR:Core, PartIR:HLO) to abstract parallelism first through functional loops and later through concrete SPMD collectives. This principled, rewriting-based approach aims to be more expressive, predictable, and maintainable than existing methods that rely on in-code annotations (like GSPMD) or opaque, monolithic automatic search. The paper provides a strong evaluation showing that this approach achieves performance comparable to the state-of-the-art while offering significant advantages in debuggability and modularity.

            Strengths

            The true strength of this paper lies in its elegant conceptual foundation and its connection to a rich history of compiler research.

            1. The "Algorithm/Schedule" Dichotomy for ML Parallelism: The most significant contribution is the successful application of the algorithm/schedule separation, famously pioneered by Halide, to the domain of distributed ML training. By treating the partitioning strategy as a first-class, composable artifact (the schedule), the authors create a powerful separation of concerns. ML engineers can focus on model architecture, while systems performance experts can focus on crafting optimal distribution strategies for different hardware backends without modifying the model code. This is a profound and much-needed shift that addresses the growing problem of model portability and maintainability described in the introduction (Section 1, page 1).

            2. Predictability through Incremental, Semantics-Preserving Rewrites: The system's design eschews complex, heuristic-based conflict resolution in favor of an ordered, incremental application of tactics. As shown in the discussion on conflicts (Section 5.2.3, page 8) and the evaluation in Section 7.4 (Figure 7, page 12), applying strategies sequentially allows for the natural prioritization of decisions (e.g., batch parallelism before parameter sharding), resolving potential conflicts in a predictable manner. This stands in stark contrast to annotation-based systems where conflicting annotations can lead to difficult-to-debug performance issues. The fact that users can inspect the IR after each tactic is a massive leap forward for the debuggability of complex parallel schemes. The results in Table 3 (page 11), which show the expected number of collectives for given strategies, provide strong evidence of this predictability.

            3. A Principled, Multi-Level IR Abstraction: The compiler architecture (Figure 3, page 5) is well-conceived. The initial lowering to PartIR:Core, which represents parallelism as functional loop and slice operations, is particularly insightful. It allows the system to reason about tiling and data distribution algebraically via the Tile-Mapping Registry (TMR), independent of the final SPMD execution model. This formal approach is more robust and extensible than the pattern-matching on low-level collectives that other systems are forced to employ, as discussed in the critique of GSPMD in Section 8 (page 12).

            Weaknesses

            The paper is strong, and the weaknesses identified are more about the boundaries of the contribution and opportunities for deeper exploration than fundamental flaws.

            1. Scope of Expressiveness and the "Reshape Problem": The paper rightly identifies that its rewriting system based on tiling and propagation faces challenges with complex data layout transformations, particularly reshapes (Section 8, page 12). This is a classic challenge for this style of compiler. While GSPMD's approach of manipulating logical device IDs is acknowledged as more flexible here, it comes at the cost of the brittleness the authors critique. The work would be stronger if it discussed potential paths forward. Could the schedule API be extended to include explicit "mesh reshaping" tactics? Or does this problem point to a fundamental limitation of the loop/slice abstraction for certain classes of computation?

            2. The Complexity Shift: From Annotations to Schedules: While the paper successfully argues for decoupling, one could argue it shifts complexity from writing correct in-code annotations to writing correct schedule programs. The AutomaticPartition tactic is presented as a solution (Section 3, page 5), but its interplay with manual tactics is not fully explored. For a user, it is not immediately clear how one would debug a situation where a manual tactic and an automatic one lead to a suboptimal, combined strategy. The paper would benefit from a more detailed discussion of the "ergonomics" of composing manual and automatic search within the PartIR framework.

            3. Positioning Relative to its Successor: The authors note that the learnings from PartIR have been incorporated into Shardy, a joint open-source project (Section 2.1, page 2). While this transparency is commendable, it leaves the reader wondering about the precise nature of this paper's contribution in the current landscape. A clearer delineation of which core PartIR concepts were "proven" and carried forward, and which were superseded by different ideas in Shardy, would help contextualize the lasting impact of this work.

            Questions to Address In Rebuttal

            1. Regarding the limitations with reshapes, could the authors elaborate on whether they see a path to supporting these transformations within the PartIR philosophy of semantics-preserving rewrites, or if this class of problem inherently requires a different abstraction?

            2. Could you provide more insight into the composition of manual and automatic tactics? Specifically, how does the system handle propagation and potential conflicts when an automatic tactic is introduced into a schedule? For instance, what happens if AutomaticPartition on axis "M" proposes a sharding that conflicts with a user's prior manual sharding on axis "B"?

            3. Could you please clarify the intellectual lineage from PartIR to Shardy? What are the one or two most critical design principles from PartIR that were validated and adopted by Shardy, and what was the primary limitation in PartIR that motivated a different approach in Shardy's design? This would greatly help the committee assess the impact of this specific paper.

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-02 17:22:26.402Z

                Reviewer: The Innovator (Novelty Specialist)

                Summary

                The paper presents PartIR, an MLIR-based compiler framework for partitioning large machine learning models for Single-Program, Multiple-Data (SPMD) execution. The central claim is that by decoupling the partitioning strategy from the model's implementation, PartIR offers a more expressive, predictable, and debuggable system. The core mechanism involves expressing partitioning strategies as a "schedule" of "tactics." Each tactic in the schedule triggers an incremental rewrite of the program's intermediate representation (IR), followed by a rule-based propagation pass that distributes sharding decisions throughout the computation graph. This approach is positioned as an alternative to monolithic, annotation-based systems like GSPMD. The authors use a layered set of MLIR dialects (PartIR:Core, PartIR:HLO) to formalize this process, translating high-level tiling loops into low-level SPMD communication collectives.

                Strengths

                The primary novel contribution of this work lies in the specific application of a well-known paradigm—schedule-based compilation—to the problem of whole-program SPMD partitioning for ML models, and the use of incrementality as a conflict resolution mechanism.

                1. A Novel Mechanism for Conflict Resolution: The most significant innovation is the use of a sequential, incremental schedule to predictably resolve sharding conflicts. In monolithic propagation systems, conflicting sharding decisions (e.g., sharding a tensor on the same dimension along multiple mesh axes) must be resolved with heuristics or user-provided annotations, which can be opaque and difficult to debug. By processing tactics sequentially and propagating their effects incrementally, PartIR provides an explicit ordering that resolves these conflicts by construction. The evaluation in Section 7.4 (page 11, Figure 7) provides compelling evidence that this incremental approach successfully partitions models where a monolithic version (PartIR-ST) fails due to memory exhaustion, demonstrating a clear benefit of this design.

                2. Domain-Specific Adaptation of a Known Paradigm: While schedule-based compilation is not a new idea (see Weaknesses), its adaptation from single-device kernel generation (e.g., Halide) to multi-device, whole-program SPMD parallelism is non-trivial and represents a novel application. The introduction of PartIR:Core with its functional loop and slice operations provides a clean abstraction for representing parallel semantics over device meshes before committing to specific SPMD collectives.

                Weaknesses

                While the engineering is impressive, the work's core conceptual pillars are adaptations of long-established ideas from the programming languages and compilers literature. The novelty is more in the combination and application than in fundamental new principles.

                1. The Core Abstraction is Not New: The central idea of separating an algorithm's definition from its optimization "schedule" is the foundational principle of systems like Halide [48], TVM [9], and others, as acknowledged in the related work section (Section 9, page 13). The paper's framing in the Abstract and Introduction could more clearly state that the novelty is not the paradigm itself, but its specific application and the benefits derived therefrom in the SPMD context. As it stands, the claims of "decoupling" might be misconstrued as a fundamental innovation of this work.

                2. Rule-Based Propagation is Standard Compiler Practice: The propagation pass, based on the Tile-Mapping Registry (TMR) described in Section 5.2.1 (page 7), is a well-engineered implementation of semantics-preserving program rewriting. However, using algebraic properties of operations to propagate transformations is a cornerstone of compiler optimization. The TMR is essentially a manually curated database of rewrite rules for propagating tiling information. While effective, this is an evolutionary application of existing compiler technology rather than a revolutionary new technique.

                3. Novelty at the Expense of Generality: The paper admits in Section 8 (pages 12-13) that the proposed abstraction has significant limitations, particularly with reshape operations. The loop-based tiling and propagation model struggles where the rank or layout of a tensor changes dramatically. In contrast, prior art like GSPMD [69] handles this by directly manipulating the mapping of logical device IDs to data shards, a more flexible if more complex approach. This suggests the novel abstraction in PartIR achieves predictability by sacrificing some of the generality found in existing systems.

                Questions to Address In Rebuttal

                1. The core concept of a schedule of tactics to guide compilation is central to systems like Halide [48] and TVM [9]. Can the authors more precisely articulate the novel scientific contribution beyond adapting this known paradigm from kernel generation to whole-program SPMD partitioning? The key seems to be incremental conflict resolution; I would encourage the authors to sharpen this as their primary contribution.

                2. Section 8 notes that PartIR's propagation model struggles with reshape operations, a challenge that GSPMD [69] addresses. Does this limitation suggest that the core abstraction of program rewriting via loop tiling is fundamentally less powerful than manipulating data layouts directly, as in GSPMD? Please comment on this trade-off between the predictability of your system and the generality of prior art.

                3. The Tile-Mapping Registry (TMR) in Section 5.2.1 appears to be a manually curated set of rewrite rules. How extensible is this registry to new or exotic operations not currently in your supported set? Is there a risk that the system's effectiveness is bottlenecked by the significant manual effort required to define these algebraic equivalences for the entire operator set of a modern ML framework?