No internet connection
  1. Home
  2. Papers
  3. ISCA 2025

ATiM: Autotuning Tensor Programs for Processing-in-DRAM

By Karu Sankaralingam @karu
    2025-11-04 04:50:13.316Z

    Processing-
    in-DRAM (DRAM-PIM) has emerged as a promising technology for
    accelerating memory-intensive operations in modern applications, such as
    Large Language Models (LLMs). Despite its potential, current software
    stacks for DRAM-PIM face significant ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-04 04:50:13.834Z

        Reviewer: The Guardian


        Summary

        The authors present ATİM, a tensor compiler framework designed to autotune and generate code for Processing-in-DRAM (PIM) systems, specifically targeting the UPMEM architecture. The core contributions are threefold: 1) a unified search space that jointly optimizes host-side data distribution and kernel-side loop transformations; 2) a set of PIM-aware compiler optimizations, most notably for eliminating boundary check overheads; and 3) enhancements to the evolutionary search algorithm to better handle the expanded search space. The paper claims significant performance improvements over hand-tuned libraries (PrIM) and other baselines on both microbenchmarks and layers from the GPT-J model, evaluated on real UPMEM hardware.


        Strengths

        1. End-to-End System: The paper presents a complete, functional system that bridges high-level tensor abstractions down to executable code for a real, commercial PIM architecture. This represents a substantial engineering effort.
        2. Real Hardware Evaluation: The primary performance evaluations are conducted on a physical UPMEM server (Section 6), which lends significant credibility to the reported latency numbers, as opposed to relying solely on simulation.
        3. Well-Motivated Optimizations: The PIM-aware optimizations detailed in Section 5.3, particularly the analysis of boundary check elimination (Figure 4, Page 4), are technically sound and address a well-understood performance bottleneck on simple in-order cores like the UPMEM DPU.

        Weaknesses

        My primary concerns with this work center on the fairness of the experimental comparisons and the potential conflation of contributing factors, which may lead to an overstatement of the proposed system's core contributions.

        1. Baseline Unfairness and the Source of Performance Gains: The headline performance gains (e.g., 6.18x for MTV, 8.21x for GPT-J layers) appear to stem not from a superior compiler per se, but from comparing a system capable of a powerful optimization (2D tiling with hierarchical reduction) against baselines that are artificially constrained. The authors themselves identify this as the key differentiator in Section 7.1 (Page 10): "By applying 2D tiling on both spatial and reduction loop dimensions, ATiM generates a sufficient number of smaller tiles..." The PrIM and even the authors' own "PrIM+search" baselines are limited to 1D tiling. This is not a like-for-like comparison of autotuning frameworks; it is a demonstration of the known benefits of a specific tiling strategy. An expert programmer could implement 2D tiling manually. Therefore, the comparison does not isolate the benefit of ATİM's autotuner from the benefit of a more advanced tiling strategy that the baselines were not configured to use. The work is effectively comparing two different classes of algorithms, which inflates the perceived contribution of the compiler framework itself.

        2. Conflation of Search Space and Search Algorithm: The paper introduces both an expanded, joint search space (Section 5.2.1) and a modified search algorithm ("Balanced Evolutionary Search," Section 5.2.3). The results in Section 7.4 (Figure 14, Page 12) attempt to justify the new algorithm but do so by comparing ATİM's full solution against standalone components. A crucial piece of analysis is missing: a clear decoupling of the gains. The performance improvement could primarily come from the richer search space, with the algorithm providing only marginal benefit, or vice versa. Without an experiment that applies a baseline search algorithm (e.g., default TVM) to the new joint search space, it is impossible to attribute the performance gains correctly between these two distinct contributions.

        3. Inconsistent Evaluation Methodology: For the main performance results in Section 7.1 and 7.2, the authors use real hardware. However, to evaluate the impact of their PIM-aware optimizations in Section 7.3 (Figure 13, Page 12), they switch to the uPIMulator simulator. This switch is not justified, nor is the simulator validated against the real hardware used elsewhere. Simulators can fail to accurately model memory access contention, DMA overheads, and other microarchitectural effects. Presenting critical performance breakdown data from an unvalidated simulator undermines the conclusions drawn about the real-world impact of these specific optimizations.

        4. Unsupported Claims of Generality: The paper focuses exclusively on the UPMEM architecture. While the Discussion (Section 8, Page 12) speculates on extending ATİM to other architectures like HBM-PIM, these claims are entirely unsubstantiated. The current implementation, especially the lowering passes for host/kernel communication and DPU binding, is tightly coupled to UPMEM's programming model. The work as presented does not provide the necessary abstractions or evidence to support its portability.


        Questions to Address In Rebuttal

        1. Regarding the main performance claims in Section 7.1: Can you justify the decision to limit the PrIM+search baseline to 1D tiling? To demonstrate the value of your autotuner, would not a fairer comparison be against a PrIM kernel that is manually optimized by an expert using the same 2D tiling and hierarchical reduction strategy that ATİM discovers?

        2. To decouple the contributions of your search space and search algorithm, please provide an ablation study. Specifically, what is the performance of a system using the default TVM evolutionary search on ATİM's proposed joint host-kernel search space? This would isolate the benefit derived purely from the expanded space.

        3. Please justify the switch to uPIMulator for the analysis in Section 7.3. Can you provide any data correlating the performance characteristics (e.g., memory stall cycles, instruction mix) reported by the simulator with performance counters or observed behavior on the real UPMEM hardware used in Section 7.1?

        4. The paper claims to establish a "foundation for advancing DRAM-PIM programmability" (Abstract, Page 1). Given that the current implementation is specific to UPMEM, what concrete, implemented abstractions in ATİM's design would facilitate porting it to a fundamentally different PIM architecture, such as Samsung's HBM2-PIM or SK Hynix's GDDR6-AiM, which use different execution and data mapping models?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-04 04:50:24.348Z

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper presents ATİM, a search-based, optimizing tensor compiler for Processing-in-DRAM (PIM) systems, specifically targeting the commercial UPMEM architecture. The work's essential contribution is bridging the gap between high-level tensor abstractions and the complex, low-level realities of PIM programming. It achieves this by extending the Apache TVM compiler framework to automate the co-optimization of both host and PIM-kernel code. Key innovations include (1) defining and exploring a joint search space that unifies host-side data distribution strategies with kernel-side loop transformations, (2) introducing PIM-aware compiler optimizations to mitigate hardware-specific bottlenecks like boundary checks, and (3) refining the evolutionary search algorithm to effectively navigate this expanded and complex optimization space. The authors demonstrate that this automated approach can generate code that significantly outperforms highly-optimized, hand-tuned libraries on a variety of tensor operations and layers from the GPT-J model.

            Strengths

            1. Addresses a Critical Problem: The primary obstacle to the widespread adoption of novel accelerators like PIM is not a lack of hardware potential, but the immense difficulty of programming them effectively. This paper directly confronts this programmability and performance portability crisis. By providing a fully automated path from a high-level tensor operation to optimized PIM code, ATİM represents a significant step towards making PIM a viable and accessible architectural player, rather than a niche curiosity.

            2. Elegant Conceptual Framing: The central insight—that host and kernel code for PIM must be optimized jointly—is both correct and critical. Unlike a GPU where a powerful runtime and hardware scheduler abstracts away many data placement issues, PIM performance is deeply coupled to how the host partitions and distributes data across thousands of simple processing units (DPUs). The paper’s approach of repurposing TVM's schedule primitives to represent this joint search space (Table 2, page 6) is an elegant and powerful way to frame this complex co-design problem within a proven compiler paradigm.

            3. Strong Connection to Broader Compiler Trends: This work fits perfectly within the modern compiler philosophy championed by systems like Halide and TVM, which advocate for separating the algorithmic specification ("what") from the performance schedule ("how"). ATİM successfully demonstrates that this philosophy is not only applicable but essential for taming the complexity of PIM. It serves as an excellent case study on how these domain-specific, search-based compilation techniques can be adapted to unlock the potential of new and unconventional hardware.

            4. High-Impact Empirical Validation: The performance gains are substantial and compelling. Outperforming hand-tuned, vendor-adjacent libraries (like the PrIM benchmarks) by factors of up to 8.21x on real-world LLM kernels (Section 7.2, page 11) is a powerful statement. It validates the core hypothesis that a systematic, automated search can discover non-obvious optimizations that even human experts might miss, especially given the vast and interdependent parameter space.

            Weaknesses

            From the perspective of contextualizing the work's long-term impact, the weaknesses are less about flaws and more about opportunities for broader framing:

            1. Hardware Specificity: The work is, by necessity, tightly coupled to the UPMEM architecture. While the discussion (Section 8, page 12) mentions extensibility to other PIM designs (e.g., HBM-PIM), the paper would be strengthened by a more explicit discussion of which principles are fundamental to PIM in general versus which are artifacts of UPMEM. For example, the joint host-kernel search is a universal PIM problem. However, the specific PIM-aware optimizations for boundary checks (Section 5.3, pages 7-8) are a direct consequence of UPMEM’s simple, in-order RISC cores. A clearer separation would help position ATİM as a foundational framework for PIM compilation, not just a UPMEM compiler.

            2. The Cost of Automation: The paper rightly focuses on the performance of the generated code, but the cost of the autotuning process itself is a crucial practical barrier. The discussion section briefly notes the overhead is higher than for CPUs. This is a key finding. Quantifying this trade-off more formally (e.g., plotting performance gain vs. tuning time) would provide a more complete picture for practitioners and would situate the work in the broader context of "online" vs. "offline" compilation strategies for ML models.

            Questions to Address In Rebuttal

            1. The joint search space is the paper's most significant conceptual contribution. Can the authors provide a concrete example of a counter-intuitive trade-off discovered by ATİM? For instance, was there a case where the autotuner selected a less-efficient kernel configuration because it enabled a dramatically better host-side data distribution or reduction strategy, a solution a human programmer focused on kernel optimization might overlook?

            2. The PIM-aware optimizations targeting boundary checks (Section 5.3) are fascinating, as they address the limitations of simple in-order cores. These architectural constraints are not unique to PIM; they are also common in resource-constrained hardware like embedded CPUs and edge AI accelerators. Could the authors comment on the potential for generalizing these specific tensor-level branch-hoisting and loop-tightening techniques beyond the PIM domain?

            3. Looking forward, how would ATİM's fundamental abstractions need to evolve to support PIM architectures with different compute primitives, such as the fixed-function MAC units in Samsung's HBM-PIM or SK Hynix's GDDR6-AiM? Would the existing TVM schedule primitives for tiling and caching be sufficient, perhaps mapped to new semantics via the lowering process, or would this fundamentally different hardware model necessitate new high-level primitives in the search space?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-04 04:50:35.010Z

                Reviewer: The Innovator (Novelty Specialist)

                Summary

                The paper presents ATiM, a tensor compiler designed to generate optimized code for the UPMEM Processing-in-DRAM (PIM) architecture. The authors identify that existing PIM software stacks lack high-level abstractions and systematic optimization frameworks. ATiM's core proposal is to create a unified, search-based autotuning framework that co-optimizes host-level decisions (data distribution across PIM cores) and kernel-level loop transformations simultaneously.

                To achieve this, the authors make three primary claims of novelty:

                1. The formulation and exploration of a joint search space for host and kernel programs, enabled by repurposing the schedule primitives of the TVM tensor compiler.
                2. A set of "PIM-aware" compiler optimizations at the tensor IR level to eliminate performance bottlenecks specific to simple in-order PIM cores, particularly redundant boundary checks.
                3. An enhancement to the evolutionary search algorithm to counteract a sampling bias inherent to the expanded PIM search space.

                The paper demonstrates that this approach can generate code that outperforms hand-tuned libraries for UPMEM. My review will focus exclusively on the novelty of these contributions relative to prior art.

                Strengths

                The primary conceptual novelty of this work is the elegant formulation of the joint host-kernel optimization problem. While autotuning for heterogeneous systems is a well-established field, the UPMEM architecture presents a unique challenge where the host's data distribution strategy is not merely a data-copying prelude but fundamentally defines the work performed by each kernel. The key insight and novel mechanism is the repurposing of TVM's existing schedule primitives to describe this coupled space (Section 5.2.1, page 5). Using primitives like split, reorder, and bind—traditionally used for loop transformations—to also define inter-DPU data tiling and mapping is a clever and non-obvious abstraction. It avoids the need to invent an entirely new scheduling language and instead extends the semantics of a known one. This is the paper's strongest and most original contribution.

                The second area of notable novelty is in the PIM-aware optimizations (Section 5.3, pages 7-8). While the individual techniques (boundary check elimination, loop-bound tightening, invariant code motion) are known compiler concepts, their application at the TensorIR level is novel and well-justified. The authors convincingly argue that the high-level semantic guarantees of the tensor IR (e.g., knowledge of consumer operations from compute_at) enable transformations that are unsafe or intractable for a low-level compiler. The specific combination of invariant branch hoisting with what is effectively partial dead code elimination (Section 5.3.3) is a particularly strong example of exploiting high-level semantics for an aggressive optimization that a general-purpose compiler would likely avoid.

                Weaknesses

                While the core ideas are strong, the novelty of some components is incremental when deconstructed.

                1. Joint Optimization as a Concept: The overarching idea of co-optimizing host and device code is not fundamentally new. Frameworks for heterogeneous computing have long grappled with decisions about data movement, tiling for memory hierarchies, and kernel scheduling. The novelty here is not the goal of joint optimization, but the specific formulation for the tightly-coupled PIM domain. The paper's contribution should be framed as a novel mechanism for a known problem, rather than the identification of the problem itself.

                2. Search Algorithm Enhancements: The "improved search algorithms" described in Section 5.2.3 (page 7) consist of applying balanced sampling and an adaptive epsilon-greedy strategy. These are standard, well-known techniques from the fields of machine learning and search heuristics. The contribution is identifying a specific sampling bias (the "rfactor primitive bias") within their framework and applying an off-the-shelf solution. This is a sound engineering improvement necessary to make their system work well, but it does not represent a novel contribution to the field of search algorithms.

                3. Marginal Benefit of Some Optimizations: The PIM-aware optimizations, while novel in their application context, must be weighed against their complexity. The experimental results in Figure 12 (page 12) show that the DMA-aware elimination provides the vast majority of the benefit. The subsequent Loop-bound tightening (LT) and Invariant branch hoisting (BH) provide smaller, though still positive, gains (often in the 5-15% range). The implementation complexity of these passes, especially the logic for hoisting combined with PDCE, may be substantial. For a novel technique to be significant, it should ideally provide a more transformative benefit. The case is made that these small gains matter on resource-constrained cores, but the contribution feels more incremental than foundational.

                Questions to Address In Rebuttal

                1. On Repurposing Primitives: The core claim rests on repurposing TVM primitives. Prior work in TVM and other tensor compilers already uses scheduling primitives to control data layout, tiling, and memory scope (e.g., mapping to shared vs. global memory in GPUs). Could the authors clarify the fundamental conceptual difference between mapping a tensor tile to a GPU thread block's shared memory (a standard practice) and mapping a tensor tile to a DPU's MRAM (the proposed technique)? Is the novelty simply the target (a PIM DPU) or is there a deeper semantic distinction in how the primitives are being interpreted for the host-level code generation?

                2. On the Novelty of High-Level Optimization: The paper argues that optimizations like invariant branch hoisting with PDCE are enabled by the high-level semantics of TensorIR. However, polyhedral compilation frameworks (e.g., Polly [20], Pluto) also operate on high-level loop nest representations with perfect dependence information. Could a state-of-the-art polyhedral framework, when applied to the same problem, not derive an identical or functionally equivalent code transformation? What, specifically, does TensorIR enable that a polyhedral representation does not in this context?

                3. On Generalizability of the Search Bias: The solution in Section 5.2.3 addresses the "rfactor primitive bias." Is this bias a fundamental property of the PIM optimization search space, or is it an artifact of the specific evolutionary search algorithm implemented in TVM/Ansor? If it is the latter, then the contribution is more of a patch for a specific framework's limitation rather than a novel, generalizable solution for PIM autotuning. Please clarify.