No internet connection
  1. Home
  2. Papers
  3. ISCA 2025

MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training

By Karu Sankaralingam @karu
    2025-11-04 04:48:37.158Z

    In
    distributed training of large DNN models, the scalability of
    one-dimensional (1D) tensor parallelism (TP) is limited because of its
    high communication cost. 2D TP attains extra scalability and efficiency
    because it reduces communication relative to 1D ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-04 04:48:37.652Z

        Paper Title: MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training
        Reviewer: The Guardian


        Summary

        The authors propose MeshSlice, a novel 2D GeMM algorithm for distributed DNN training. The central idea is to partition (or "slice") collective AllGather (AG) and ReduceScatter (RdS) operations into multiple smaller, partial collectives. This partitioning, combined with software pipelining, is claimed to enable the overlap of communication and computation in both row and column dimensions of a 2D accelerator mesh. The paper also introduces an autotuner that uses analytical cost models to select optimal dataflows, mesh shapes, and communication granularities.

        However, the paper's central claims of superiority rest almost entirely on simulated results derived from a model calibrated on a very small-scale system. The included real-hardware experiments not only fail to support the primary claims but actually show MeshSlice underperforming against established baselines due to current software limitations. The validity of the large-scale simulations and the practical feasibility of the proposed approach are therefore highly questionable.

        Strengths

        1. Problem Motivation: The paper correctly identifies a critical bottleneck in scaling tensor parallelism: the inability of standard "Collective 2D GeMM" algorithms to overlap communication with computation. The motivation to solve this is well-founded.
        2. Conceptual Approach: The core idea of slicing a collective operation into multiple partial collectives to facilitate pipelining is conceptually sound and presents an alternative to decomposing them into point-to-point SendRecv operations (as in Wang et al. [34]) or fine-grained broadcasts (as in SUMMA [30]).
        3. Autotuner Framework: The recognition that 2D TP performance is sensitive to a complex interplay of hyperparameters (dataflow, mesh shape, sharding, granularity) is astute. Proposing an autotuner to navigate this space is a valuable direction, even if the implementation here has weaknesses.

        Weaknesses

        1. Over-reliance on Simulation and Lack of Empirical Validation: The primary weakness of this work is that its performance claims are not substantiated by real-world evidence.

          • The impressive speedup figures (e.g., 12.0% and 23.4% faster than the state-of-the-art in Section 1, Abstract) are derived exclusively from the SST simulator (Section 5.1).
          • The simulator itself is calibrated on a small 4-chip TPUv4 ring (Section 4.1, page 9). It is a significant and unsubstantiated leap to assume that a model calibrated on a 4-chip 1D topology can accurately predict the complex network dynamics, contention, and synchronization overheads of a 256-chip 2D torus. Non-linear scaling effects are not adequately addressed.
          • Critically, the real hardware experiments on a 4x4 TPUv4 cluster (Section 5.3, Table 3) show MeshSlice performing worse than the Collective baseline (45.5% vs 47.4% FLOP utilization for GPT-3). The authors attribute this to a lack of software support for overlapping AG/RdS, but this does not excuse the lack of empirical proof. The "MeshSlice Overlap (Estim.)" column in Table 3 is purely speculative and cannot be considered a result.
        2. Flawed Validation of the Autotuner's Cost Model: The autotuner's utility depends entirely on the accuracy of its analytical cost models. The validation provided is circular and insufficient.

          • The cost model for communication is a simple linear function (Section 3.2.2, page 8) that ignores network contention, a crucial factor in large, busy torus networks.
          • The validation in Figures 13 and 14 compares the analytical model's predictions against the simulator's results, not against real hardware. This merely demonstrates that the simple analytical model can approximate the behavior of the complex simulator. It provides zero evidence that either the model or the simulator accurately reflects reality at scale.
        3. Unclear Practicality and Ignored Implementation Hurdles: The authors admit that their method is not currently feasible on the target hardware (TPUv4) due to the lack of support for asynchronous, overlapped AG/RdS operations (Section 5.3).

          • This is not a minor implementation detail; it is a fundamental barrier to adoption. The paper presents a technique that relies on a software/hardware capability that does not exist in the evaluated public environment.
          • There is no discussion of the software engineering, compiler modifications, or runtime support that would be required to enable this functionality. Without this, the work remains a purely theoretical exercise.
        4. Insufficient Justification for Slicing Granularity (S): The paper claims that the slice count S provides a knob to trade off prologue/epilogue overhead against synchronization overhead (Section 3.1, page 5). However, the analysis is superficial. The optimal S is found via an exhaustive search over a small set of integers. A more rigorous analysis would characterize this trade-off more formally and explain why this new partitioning scheme is fundamentally better than existing ones (e.g., Wang et al.'s SendRecv decomposition), which also allows for granularity control via message size.

        Questions to Address In Rebuttal

        1. How can the authors justify the validity of their 256-chip simulation results when the simulator was calibrated on a 4-chip ring? Please provide specific evidence or analysis showing that network contention and synchronization overheads are modeled accurately at this scale.
        2. Given that MeshSlice underperforms on the real 4x4 TPUv4 cluster (Table 3), why should the community accept the speculative "Overlap (Estim.)" figures as evidence of the algorithm's effectiveness? This appears to be a claim based on a hypothetical machine, not the one that was tested.
        3. The autotuner's cost model is validated against the simulator, not hardware. Can you provide any data from real hardware (even at the 4x4 scale) that validates the accuracy of your analytical communication cost model?
        4. The entire premise of MeshSlice hinges on the availability of asynchronous AG/RdS collectives. Could you detail the specific compiler and runtime system changes required to enable this on a platform like JAX/XLA for TPUs? What is the estimated engineering effort and feasibility of such changes?
        5. In the real hardware experiments (Section 5.3.1), Wang et al.'s method also shows a smaller-than-expected speedup, which you attribute to JAX compiler optimizations creating dependencies. Could this same compiler behavior also be creating unforeseen dependencies for MeshSlice, invalidating the core assumption that your partial GeMMs can be effectively overlapped with your partial collectives in a real execution environment?
        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-04 04:48:48.154Z

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper introduces MeshSlice, a novel algorithm for 2D tensor parallelism (TP) in distributed DNN training, designed to overcome the limitations of existing 2D GeMM methods. The core problem is that prior algorithms either incur high communication traffic (Cannon), suffer from synchronization overhead (SUMMA), or cannot effectively overlap communication with computation (Collective-based approaches). The state-of-the-art (e.g., Wang et al.) can only overlap communication in one dimension of the 2D accelerator mesh.

            The essential contribution of MeshSlice is a new technique for partitioning collective communication operations (AllGather and ReduceScatter) into multiple partial collectives. This "slicing" enables software pipelining that can overlap communication with computation along both the row and column dimensions of the accelerator mesh simultaneously. This bidirectional overlap is the key mechanism for hiding communication latency more effectively. Complementing the algorithm, the authors present the MeshSlice LLM autotuner, which uses analytical cost models to automate the complex task of finding optimal configurations for dataflow, mesh shape, and communication granularity. The work is evaluated via simulation on TPUv4-like clusters and demonstrates significant performance improvements over existing methods for training large language models like GPT-3 and Megatron-NLG.

            Strengths

            1. Elegant and Foundational Core Idea: The central concept of partitioning a collective operation into smaller, partial collectives is a powerful and elegant solution to the communication overlap problem. It directly addresses the "all or nothing" nature of the standard Collective 2D GeMM algorithm and provides a principled way to overcome the one-dimensional limitation of prior overlap techniques. This feels less like a heuristic and more like a fundamental primitive that was missing from the distributed computing toolbox for this domain.

            2. Excellent Contextualization and Problem Framing: The paper does a superb job of placing itself within the historical and current landscape of distributed training. The introduction and background sections (Sections 1 & 2, pages 1-4) clearly articulate the evolution from 1D TP to 2D TP and provide a concise yet thorough analysis of the trade-offs between Cannon, SUMMA, and Collective-based algorithms. The timeline comparison in Figure 4 (page 5) is particularly effective at visualizing the specific inefficiency that MeshSlice resolves. This demonstrates a deep understanding of the research area.

            3. Addresses the Full System Problem: The authors recognize that a novel algorithm alone is insufficient without a means to configure it. The inclusion of the MeshSlice LLM autotuner is a major strength. Optimizing 2D TP involves a complex interplay between sharding, dataflow, mesh topology, and granularity (the slice count S). By providing an automated, model-driven approach to this co-optimization problem (Section 3.2, page 7), the authors elevate the work from a theoretical algorithm to a practical, usable system. This greatly increases its potential for real-world impact.

            4. Significant Potential Impact on Hardware Co-Design: This work has clear implications for the future of ML accelerator design. By providing a software technique that can effectively utilize 2D torus interconnects, it strengthens the case for building such topologies over more complex and expensive fully-connected networks, especially at scale. The ability of MeshSlice to adapt to non-square mesh shapes is crucial, as it allows system designers to match the mesh topology to the aspect ratio of the underlying matrix operations, rather than the other way around.

            Weaknesses

            While this is a strong paper, there are areas where its context and scope could be further explored.

            1. Dependency on Physical Mesh Topology: The performance benefits of MeshSlice hinge on the existence of a physical 2D torus interconnect where row and column communications are independent and contention-free. This is characteristic of Google's TPU clusters but not of most GPU clusters, which typically use fat-tree or Dragonfly topologies. The authors briefly acknowledge this in the Discussion (Section 6, page 13), but the challenge is significant. On a logical mesh mapped to a fat-tree, row and column collectives would contend for the same network links, potentially negating much of the benefit from bidirectional overlap. The paper would be stronger if it included a more detailed analysis or preliminary data on the sensitivity of MeshSlice to network contention.

            2. Simplicity of the Autotuner's Dataflow Heuristic: The autotuner's method for choosing a dataflow (Section 3.2.1, page 7) is based on a heuristic: keep the largest matrix stationary and default to a non-transposed version to avoid reshuffling data between layers. While this is a sensible and likely effective rule of thumb for standard Transformer architectures, it feels less rigorous than the rest of the paper. It is an open question whether more exotic model architectures or memory access patterns might favor a dataflow that this heuristic would miss.

            Questions to Address In Rebuttal

            1. Applicability to Contention-Prone Networks: Could the authors elaborate on the viability of MeshSlice for GPU-based supercomputers? When implementing MeshSlice on a logical mesh over a fat-tree network, how would the analytical cost models in the autotuner need to be adapted to account for network contention between the row and column communications? Would the optimal strategy perhaps revert to overlapping in only one dimension, similar to Wang et al., to avoid this contention?

            2. Robustness of the Dataflow Heuristic: Regarding the autotuner's dataflow selection, have the authors considered architectures where minimizing the size of the communicated tensors is more important than keeping the largest tensor stationary? For example, in a memory-capacity-constrained scenario, it might be preferable to communicate a smaller tensor even if it requires transposing a larger, stationary one. Can you comment on the robustness of your heuristic outside of the standard LLM training regime?

            3. Broader Generalizability of "Partial Collectives": The core idea of slicing collectives into partial collectives to enable pipelining is very compelling. Do you see this as a general-purpose technique? For instance, could this approach be used to optimize other distributed algorithms that are bottlenecked by large, monolithic collective operations, even outside the context of GeMM or machine learning?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-04 04:48:58.682Z

                Reviewer Persona: The Innovator (Novelty Specialist)


                Summary

                The paper proposes MeshSlice, a new algorithm for 2D General Matrix Multiplication (GeMM) aimed at optimizing 2D Tensor Parallelism (TP) in distributed DNN training. The central novel claim is a technique to partition collective communication operations (AllGather/ReduceScatter) into smaller partial collectives. This partitioning is applied to both the row and column dimensions of the accelerator mesh, enabling communication-computation overlap in both directions simultaneously. This is presented as a solution to the limitations of prior art: Cannon's algorithm is constrained to square meshes, SUMMA has high synchronization overhead, Collective 2D GeMM is blocking, and Wang et al. [34] only achieve overlap in a single dimension. The paper also introduces an autotuner that co-optimizes the dataflow, mesh shape, and a new "slice count" hyperparameter S introduced by the MeshSlice algorithm.

                Strengths

                1. Clearly Articulated and Novel Core Algorithm: The paper's primary strength lies in its core algorithmic contribution. The proposed method of slicing a collective operation into a series of smaller, partial collectives to enable pipelining is a distinct and novel approach in the context of 2D distributed GeMMs for DNN training. While prior work has explored partitioning, it has focused on different mechanisms with significant drawbacks.

                2. Excellent Positioning Relative to Prior Art: The authors demonstrate a comprehensive understanding of the landscape of 2D GeMM algorithms. They correctly identify that Wang et al. [34] represents the closest state-of-the-art and precisely articulate the gap that MeshSlice aims to fill: achieving bidirectional overlap without incurring the high traffic and restrictive constraints of Cannon's algorithm [4]. The claim made in Section 2.3.4 (page 4) that partitioning collectives in both directions would otherwise require Cannon's algorithm effectively frames the problem that MeshSlice's novelty is designed to solve.

                3. Introduction of a New, Tunable Parameter: The algorithm introduces the slice count S as a new hyperparameter that controls the granularity of the communication-computation overlap. This is a novel lever for performance tuning that does not exist in the monolithic Collective or Wang et al. algorithms. The autotuner's ability to optimize this parameter is a valuable supporting contribution.

                4. Novelty in the Autotuner's Scope: While the use of analytical cost models is a standard technique, the autotuner's novelty lies in its application to the unique search space created by MeshSlice. Specifically, the ability to choose between different 2D dataflows (Section 3.2.1, page 7) and co-optimize this choice with mesh shape and the new slice count S is a more comprehensive approach than seen in related works like PrimePar [33], which is based on the more restrictive Cannon's algorithm.

                Weaknesses

                1. Incremental Nature of the Core Concept: While the application is novel, the fundamental concept of partitioning a large collective operation into smaller chunks to facilitate pipelining is a well-established pattern in high-performance computing (HPC). The paper would be strengthened by acknowledging that the core novelty is not the invention of "partial collectives" but rather their specific, structured application to solve the bidirectional overlap problem in 2D GeMMs for DNNs.

                2. Novelty is Tightly Coupled to Torus Topology: The central claim of efficient, bidirectional overlap relies heavily on the non-contending nature of row and column communications in a physical 2D torus network (as found in TPU clusters). This assumption is critical to the algorithm's performance model. The novelty and associated benefits are less certain on more common interconnect topologies like fat-trees or Dragonfly networks, where logical row and column communications would contend for shared network resources. The discussion in Section 6 (page 13) acknowledges this but does not fully address how the fundamental value proposition of the algorithm might change in these prevalent hardware environments.

                3. The "Slicing" Mechanism Itself: The mechanism for slicing the matrix shards (Section 3.1.2, page 6) is a necessary implementation detail, but it is not presented as a fundamental algorithmic novelty in its own right. It is an enabling technique for the core idea of partial collectives. The paper is clear about this, but it is worth noting that the novelty rests on the communication scheme, not the data layout transformation.

                Questions to Address In Rebuttal

                1. The core mechanism relies on partitioning AllGather/ReduceScatter into a loop of partial collectives on sub-shards. Could the authors clarify if this "partial AllGather/ReduceScatter" is a standard primitive available in the underlying communication libraries (e.g., as part of JAX/XLA) or a custom communication schedule implemented specifically for this work? This would help delineate the software engineering contribution from the conceptual novelty.

                2. The key advantage over Wang et al. [34] is achieving overlap in a second dimension. This benefit is maximized on a physical torus where the two dimensions are independent. How does the novelty of MeshSlice hold up on a logical mesh mapped to a physical fat-tree network? Specifically, if row and column communications contend for the same physical links, does the benefit of overlapping the second dimension diminish significantly, potentially making it a marginal improvement over the simpler single-dimension overlap from prior art in that common scenario?

                3. The autotuner optimizes dataflow, mesh shape, and slice count S. Table 2 (page 11) shows a significant 21.2% speedup for GPT-3 from dataflow optimization alone. Can the authors provide a sense of the relative importance of the novel contributions within the autotuner? Is the dataflow selection the dominant factor, or does the co-optimization of the new parameter S with the mesh shape provide a similarly crucial contribution to performance?