Composing Distributed Computations Through Task and Kernel Fusion
We
introduce Diffuse, a system that dynamically performs task and kernel
fusion in distributed, task-based runtime systems. The key component of
Diffuse is an intermediate representation of distributed computation
that enables the necessary analyses for ...ACM DL Link
- KKaru Sankaralingam @karu
Paper Title: Composing Distributed Computations Through Task and Kernel Fusion
Reviewer: The Guardian
Summary
The authors present Diffuse, a system that sits between high-level distributed libraries (cuPyNumeric, Legate Sparse) and a low-level task-based runtime (Legion). The core contribution is a "scale-free" intermediate representation (IR) designed to enable scalable analysis for task fusion in a distributed memory setting. This task fusion is paired with an MLIR-based JIT compiler to perform kernel fusion on the fused task bodies. The authors claim that this approach can accelerate unmodified applications by a geometric mean of 1.86x on up to 128 GPUs, and in some cases, match or exceed the performance of a manually-tuned MPI library, PETSc.
Strengths
- The fundamental concept of a middle-layer, domain-agnostic IR for identifying fusion opportunities across library boundaries is a sound and valuable direction for research in composable software.
- The IR's "scale-free" design, which makes the cost of analysis independent of the machine size, is an elegant solution to a known scalability challenge in distributed program analysis.
- The integration with MLIR for the kernel fusion backend is a practical and powerful choice, allowing the system to leverage a robust, community-driven compiler infrastructure.
- The system is demonstrated on two distinct, real-world libraries, providing some evidence for the generality of the proposed IR and analyses.
Weaknesses
My primary responsibility is to ensure the technical and empirical soundness of papers accepted by this conference. While the ideas presented are interesting, I have significant concerns about the methodology and the strength of the evidence supporting the paper's core claims.
-
Conflation of Contributions due to Missing Ablation Study: The paper claims end-to-end speedups but fails to disentangle the sources of performance gain. The total speedup comes from a combination of: (1) task fusion reducing runtime scheduling overhead, (2) temporary store elimination reducing memory traffic and allocation, and (3) kernel fusion improving data locality and arithmetic intensity. The latter two are well-known, powerful optimizations. The novel contribution is the scale-free IR enabling task fusion. Without an ablation study that evaluates "task fusion only" vs. "task fusion + kernel fusion", it is impossible to assess the actual benefit of the paper's core idea. It is plausible that most of the performance gain stems from standard MLIR-based loop fusion (a known technique), with the task fusion component providing only marginal benefit by enabling it. The authors explicitly state, "We do not ablate on the optimizations in Section 5" (page 10), which is a critical methodological flaw.
-
Questionable Baseline Comparisons: The comparison against PETSc is presented as a major result, but the experimental setup is unsound. A footnote on page 10 reveals that Legate Sparse was modified to use 32-bit integers for coordinates to match a specific behavior of PETSc. This is not a fair comparison. A rigorous evaluation would require ensuring both systems are configured optimally and use identical data types and precision for the computation. As presented, the authors are comparing a version of their system specifically tuned for the benchmark against what may be a default, unoptimized, or misconfigured baseline. This undermines the claim of "exceeding the performance of PETSc."
-
Limited Expressiveness of the IR: The paper's claims of generality hinge on its IR. However, Section 3 only formally describes
NoneandTilingpartitions. These structured, affine partitions are well-suited for dense computations but are insufficient for a wide range of scientific applications involving irregular data structures, such as unstructured meshes or graph analytics, which require more complex, indirect partitioning schemes. The authors state their "implementation supports more partition kinds" (page 4) but provide no details. Without this information, the reader cannot evaluate the true generality of the approach. The constant-time alias checking, a key enabler of the scalable analysis, is a direct consequence of this restricted representational power. -
Benchmark Suite Bias Skews Aggregate Results: The reported geometric mean speedup of 1.86x is heavily distorted by the Black-Scholes benchmark, which achieves a 10.7x speedup (Figure 10a). This application is a textbook example of an embarrassingly parallel computation with a long chain of fusible operations—a "hero" benchmark that represents the absolute best-case scenario for this system. In contrast, the Dense Jacobi benchmark shows no meaningful improvement (0.93x-1.08x), demonstrating the system's limitations. Presenting a single geometric mean obscures this reality and overstates the typical performance gain a user should expect.
-
Insufficient Proof of Correctness: The proof sketch for the fusion algorithm's correctness (Section 4.3) is cursory. It argues that the fusion constraints are sufficient to guarantee point-wise dependencies but does not provide a formal argument. Furthermore, the paper acknowledges the constraints are "sound, but not complete" (page 6). There is no discussion of the practical implications of this incompleteness. What types of valid fusion opportunities are missed by this conservative analysis? A more thorough treatment is necessary to establish confidence in the algorithm's robustness and to understand its limitations.
Questions to Address In Rebuttal
The authors must address the following points to convince me of the paper's validity:
- Can you provide an ablation study that isolates the performance impact of task fusion (i.e., runtime overhead reduction) from the impact of kernel fusion and temporary elimination? This is essential to substantiate the value of your primary contribution.
- Please justify the fairness of the PETSc comparison. Ideally, provide results where both PETSc and Diffuse/Legate Sparse are configured to use the same data precision (e.g., both using 64-bit integers for coordinates) and are otherwise optimally tuned.
- What specific, non-affine partition kinds does your system support beyond the
Tilingdescribed in the paper? How does your scale-free alias analysis handle these more complex, potentially irregular partitions? - Can you provide a performance breakdown that clarifies the impact of outliers? For example, what is the geometric mean speedup if the Black-Scholes benchmark is excluded?
- What is a concrete example of a valid fusion that your constraints (Figure 5) would incorrectly disallow? A discussion of the trade-off between analytical complexity and optimization power is needed.
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Paper: Composing Distributed Computations Through Task and Kernel Fusion
Reviewer: The Synthesizer (Contextual Analyst)
Summary
This paper introduces Diffuse, a system that sits between high-level task-based parallel libraries and a low-level distributed runtime system to perform dynamic optimization. The core contribution is a novel, scale-free intermediate representation (IR) for distributed computations. This IR is the key enabler for the paper's primary goal: to scalably analyze and fuse streams of distributed tasks, even when those tasks originate from different libraries. By identifying fusible sequences of tasks, Diffuse not only reduces runtime overhead but also creates opportunities for a JIT compiler (built on MLIR) to perform aggressive kernel fusion, eliminating temporary distributed allocations and improving data locality. The authors demonstrate Diffuse's effectiveness by integrating it with the cuPyNumeric and Legate Sparse libraries running on the Legion runtime, showing significant performance improvements (1.86x geo-mean) on a variety of scientific applications on up to 128 GPUs, often matching or exceeding the performance of hand-optimized code and established MPI-based libraries like PETSc.
Strengths
This is an excellent paper that makes a strong contribution to the field of high-performance, composable software. Its primary strengths are:
-
Elegant Separation of Concerns: The system's architecture is its most powerful feature. By introducing a "middle layer" with a purpose-built IR, the authors cleanly decouple the problem of distributed dependency analysis from the problem of single-node kernel optimization. The scale-free IR (Section 3, Page 3) is designed to make the former tractable and scalable, while the integration with MLIR (Section 6, Page 8) leverages a powerful, existing ecosystem for the latter. This is a very insightful way to structure the problem.
-
The Scale-Free IR as a Core Contribution: The central idea of a "scale-free" representation—one whose complexity is independent of the machine size—is critical. It directly addresses the primary challenge of performing program analysis on massively parallel systems, where reasoning about every concrete task instance is infeasible. This approach builds philosophically on concepts like Legion's Index Launches [50], but applies the principle specifically to the problem of dynamic, cross-library fusion. The ability to perform constant-time alias checks on structured partitions is a direct and valuable result of this IR design.
-
Tackling the Composability "Holy Grail": Many systems have attempted to optimize across library boundaries, but most, like Weld [44], have focused on shared-memory contexts. A key contribution of this work is demonstrating a practical path forward for composition in the more complex distributed memory setting. By operating on a common task-based abstraction, Diffuse successfully finds optimization opportunities that are invisible to individual libraries, as shown in the evaluation where it improves upon already hand-optimized code (Section 7.1, Page 11, Figure 12c).
-
Impressive and Convincing Empirical Results: The evaluation presented in Section 7 is comprehensive and compelling. The authors use a weak-scaling methodology, compare against strong and relevant baselines (including hand-tuned versions and the highly-respected PETSc library), and evaluate a range of applications from micro-benchmarks to full-fledged scientific solvers. The results, showing consistent and significant speedups without any application code changes, strongly validate the system's design and practical utility.
Weaknesses
The paper is very strong, and the weaknesses are more about the boundaries of the current work and opportunities for future exploration rather than fundamental flaws.
-
Uncertainty on Generality of the Data Model: The paper's IR for partitions is presented with a focus on structured kinds like
Tiling(Section 3.1, Page 4). While the authors state their implementation supports more, the power of underlying runtimes like Legion comes from their ability to handle complex, irregular, and data-dependent partitions. It is not entirely clear how the scale-free, symbolic analysis proposed by Diffuse would extend to these more unstructured cases, where checking for aliasing may no longer be a simple, constant-time symbolic operation. This might represent a fundamental tension between the analyzability of the IR and the expressiveness of the underlying runtime. -
Limited Scope of Fusion Strategy: The fusion algorithm is greedy and focuses on finding a fusible prefix of tasks in a linear window (Section 4.2.2, Page 6). This is a pragmatic and effective starting point. However, this places the work firmly in the category of local, peephole-style optimizations. More global optimization opportunities, such as reordering independent tasks to create larger fusible blocks or fusing non-contiguous tasks, are not considered. While out of scope for one paper, a discussion of the limitations of the current strategy would be welcome.
-
Potential for Adverse Interaction with Runtime Schedulers: Diffuse fundamentally alters the task graph, transforming many small tasks into fewer, larger ones. This has implications for the downstream runtime scheduler. For example, a very long fused task could negatively impact load balancing or increase tail latency. The paper does not explore this interaction. While the performance results suggest this is not a problem for the benchmarks chosen, it remains a potential issue for more dynamic or heterogeneous workloads that could be addressed.
Questions to Address In Rebuttal
-
Regarding the partition model in the IR (Section 3.1): Could you please elaborate on the challenges of extending your scale-free analysis to the more general, unstructured partitions supported by Legion? Specifically, how would the system perform alias analysis on partitions defined by arbitrary collections of points, and what would be the impact on the scalability of the analysis?
-
The current fusion algorithm is greedy and limited to contiguous task prefixes. Could you comment on the potential for more sophisticated fusion strategies within the Diffuse framework? For example, what would be the primary obstacles in the IR or analysis to support reordering tasks to enable larger fusions?
-
Your system transforms the task stream by creating fewer, larger-grained tasks. How does this transformation affect the scheduling and load-balancing capabilities of the underlying runtime (e.g., Legion)? Have you observed any cases where creating a very large fused task leads to performance degradation due to, for instance, poor work distribution?
-
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Reviewer: The Innovator
Summary
This paper introduces Diffuse, a system for performing dynamic task and kernel fusion for applications built on distributed, task-based runtimes. The central claim of novelty rests on a new intermediate representation (IR) for distributed computation. This IR is described as "scale-free," meaning its size and the complexity of analyses performed upon it are independent of the number of processors in the target system. The authors leverage this IR to perform a scalable, dynamic dependence analysis that identifies fusible sequences of distributed tasks. This task fusion is then coupled with a JIT compiler (built on MLIR) that fuses the kernels within the newly formed tasks. The authors claim this synthesis of scalable distributed task fusion and kernel fusion enables optimization across library boundaries, allowing high-level Python programs to match or exceed the performance of hand-tuned MPI code.
Strengths
The primary strength and novel contribution of this work is the design of the scale-free IR (Section 3, page 3) and the fusion analysis framework built upon it. While the constituent ideas—task fusion, kernel fusion, JIT compilation—are not new in isolation, their synthesis in this specific context is. The key insight is that by creating a higher-level, more structured representation of distributed partitions and computations than what a low-level runtime like Legion provides, certain critical analyses (like aliasing) become tractable at scale.
Specifically, the novel aspects are:
-
The Scale-Free IR Abstraction: The formalization of distributed data and computation into
Store, structuredPartitiontypes (e.g.,Tiling), andIndex Taskconstructs is the core technical kernel. This abstraction enables symbolic, constant-time alias checking between structured partitions, which is the cornerstone of the scalable fusion analysis. This is a clear advancement over analyzing the lower-level, unstructured partitions that a runtime like Legion might expose, where such a check would scale with the number of processors. -
A Formal Framework for Distributed Task Fusion: Section 4 (page 5) presents a formal definition of dependencies between distributed task groups via a "dependence map." The paper then develops a set of fusion constraints (Figure 5, page 6) that can prove the non-existence of cross-processor dependencies without needing to materialize this map. While the dependency types (true, anti, reduction) are classic compiler concepts, their formulation and application within this scale-free, distributed IR are novel.
-
Synthesis of Two Fusion Levels: The most significant conceptual advance is the coupling of the high-level, distributed task fusion with low-level kernel fusion. Prior work has often focused on one or the other. Diffuse uses the high-level analysis to solve the distributed data-movement and dependency problem, creating a valid fused task. This fused task then presents a traditional, single-node loop fusion problem to the MLIR-based JIT, which can leverage a rich ecosystem of existing compiler techniques. This two-level approach elegantly separates distributed systems concerns from compiler optimization concerns.
Weaknesses
The main weakness of the paper, from a novelty perspective, is its close relationship to prior art from the same research group, which could be made more explicit.
-
Incremental Advance over Prior Art: The paper correctly identifies Sundram et al. [51] as the "most related work." Indeed, that work introduced the core problem of fusing index tasks in a distributed runtime. The current paper is a direct and significant extension of [51]. The delta is the addition of the JIT-based kernel fusion and a more rigorous formalization of the analysis. However, the paper could do a better job of positioning itself by explicitly stating, early on, "Our work builds on Sundram et al. [51] by..." and then enumerating the precise advancements (e.g., the formal model, the proof, and the critical addition of the kernel fusion compiler). As written, the connection is only fully clarified in the related work section, which downplays the incremental nature of the contribution.
-
Limited Expressiveness of the IR Primitives Shown: The paper's exposition of the IR's
Partitionconstruct focuses onNoneandTilingfor simplicity (Section 3, page 4). This structure is what enables the efficient, scale-free analysis. However, a key feature of the underlying Legion runtime is its ability to handle complex, irregular, and data-dependent partitions. It is not clear from the paper how the Diffuse IR and its constant-time alias analysis would extend to these more complex partitionings. If the novel technique is only applicable to dense, affine partitions, its overall novelty and impact are somewhat constrained. The authors mention their implementation supports more, but this is not substantiated in the paper. -
Novelty of Constraints vs. their Application: The fusion constraints themselves (launch-domain equivalence, true/anti/reduction dependence checks in Figure 5, page 6) are conceptually direct analogues of classic data-dependency checks. The novelty is not in the invention of these constraints, but in their precise formulation and application to the paper's novel IR. The authors should be careful to claim novelty for the latter, not the former.
Questions to Address In Rebuttal
-
Could you please explicitly and concisely list the technical contributions of this paper over Sundram et al. [51]? Specifically, beyond the clear addition of kernel fusion, what aspects of the formal model for task fusion and the fusion constraints are new, and what did they enable that was not possible before?
-
The scalability of your fusion analysis hinges on the structured nature of partitions in your IR. How does your framework handle irregular or data-dependent partitions, which are supported by the underlying Legion runtime? If it requires custom partition-kind-specific rules, does this not compromise the generality of the approach? Can you provide an example of how an alias check between two non-affine partitions would be performed in a scale-free manner?
-
In Section 4.2.1 (page 6), you state your fusion constraints are "sound, but not complete." Can you provide a compelling, realistic example of a sequence of fusible distributed tasks that Diffuse would fail to fuse? Understanding the limitations of the analysis is key to evaluating the scope of the novel contribution.
-