OptiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming
Processing-
in-memory (PIM) accelerators provide superior performance and energy
efficiency to conventional architectures by minimizing off-chip data
movement and exploiting extensive internal memory bandwidth for
computation. However, efficient PIM ...ACM DL Link
- KKaru Sankaralingam @karu
Reviewer Persona: The Guardian (Adversarial Skeptic)
Summary
The authors present OptiPIM, a framework that uses Integer Linear Programming (ILP) to generate what they claim are "optimal" mappings for data-intensive operators on Processing-in-Memory (PIM) accelerators. The central thesis is that existing mapping frameworks designed for ASIC accelerators are inadequate for PIM architectures due to PIM's stringent data layout requirements and the resulting complex cost trade-offs. The paper's primary contributions are a new "PIM-friendly" mapping representation based on a nested loop structure with an expanded set of indexing functions, and a more accurate cost model for estimating data movement that accounts for these layouts. The framework is evaluated on two PIM architectures (HBM-PIM and SIMDRAM) and shows significant speedups over heuristic and adapted ASIC-based mapping methods.
While the problem addressed is both timely and significant, I harbor serious reservations regarding the central claims of optimality, the methodological soundness of the baseline comparisons, and the generalizability of the reported results. The work rests on a foundation that appears strong at first glance but reveals critical weaknesses upon rigorous inspection.
Strengths
-
Problem Motivation: The paper does an excellent job in Section 3 (page 3) of motivating why PIM mapping is a distinct and challenging problem. The example in Figure 2 clearly illustrates how different computation schedules impose vastly different data layout, replication, and movement costs, correctly identifying the core deficiencies of layout-agnostic mapping approaches.
-
Improved Cost Estimation for Unique Indices: The authors correctly identify a key source of inaccuracy in prior work—the use of loop bound products to estimate data size. The proposed estimation function for unique indices in a linear combination (Equation 11, page 8) is more mathematically principled and demonstrably more accurate (Figure 7) than the naive upper-bound approach. This is a solid, albeit localized, contribution.
-
Validation against Exhaustive Search: The claim of optimality is, for the specific workloads tested, substantiated by a comparison against a full exhaustive search (Section 7.5.3, page 12). This provides confidence that, within the confines of their defined search space and cost model, the ILP formulation is indeed finding the best solution.
Weaknesses
-
The Claim of "Optimality" is Circular and Unverified: The paper's most prominent claim is its ability to generate optimal mappings. However, this optimality is only with respect to its own analytical cost model. This model's "ground truth" for validation is a custom-modified version of the Ramulator 2 simulator (Section 7.1, page 10 and Artifact Appendix A.1, page 15). The authors provide no evidence that this simulator has been validated against real PIM hardware or a trusted hardware model. The high correlation shown in Figure 9 only proves that the analytical model is a faithful abstraction of their simulator; it does not prove that either accurately reflects reality. Without validation against a genuine hardware ground truth, the "optimality" is merely a theoretical construct within a self-contained system.
-
Fundamentally Flawed Baseline Comparison: The comparison against the ASIC-based mapping method (COSA [23]) is a classic strawman argument. The authors state they "adopt COSA’s ILP optimization... without considering the fine-grained data layout-related costs" (Section 6.4.3, page 10). This is an inherently unfair comparison. The entire premise of the paper is that these costs are critical. By disabling the very capability that would make the baseline competitive, the authors are not comparing OptiPIM to a state-of-the-art ASIC mapper adapted for PIM; they are comparing it to a deliberately crippled baseline. A scientifically valid comparison would have involved attempting to integrate the new PIM-specific cost functions into the COSA framework to see if it could then find a comparable mapping. The current results only prove that ignoring PIM costs leads to poor PIM performance, which is a trivial conclusion.
-
Scalability of the ILP Formulation is Not Proven: The authors report a maximum optimization time of 4 minutes (Abstract, page 1), suggesting the ILP approach is highly efficient. This claim is based on a limited set of operators from common ML models. ILP problem complexity can scale exponentially with the number of variables and constraints. The paper provides no analysis or empirical data on how the solver runtime scales with key problem parameters, such as the number of loop nests, the magnitude of loop bounds, or the number of tensors. A single, more complex operator from a different domain could potentially cause the solver time to explode, rendering the approach impractical. The claim of efficiency is therefore not robustly supported.
-
Novelty and Expressiveness of the Mapping Representation is Overstated: The proposed mapping representation is described as a key innovation. However, the hierarchical nested loop structure for PU, column, and computation scheduling (Section 4.2, page 5) is conceptually equivalent to standard loop tiling and vectorization, common in many compiler frameworks. The main novelty lies in the exploration of 6 coefficient combinations for the indexing function (Section 4.3, page 6). While this is an improvement over a fixed stride-1 indexing, the paper fails to discuss why this fixed-structure approach is superior to more general and powerful representations like the polyhedral model, which can represent a much broader class of transformations and data layouts. It is not clear that the proposed representation can capture all potentially useful mappings.
Questions to Address In Rebuttal
-
Regarding the ASIC Baseline: Please justify the decision to evaluate the COSA framework without incorporating the PIM-specific cost terms you identified. Could the authors instead augment the ASIC baseline's cost model with these terms and re-evaluate? If this is not possible, please provide a detailed explanation of the fundamental limitations in the baseline's representation that prevent such an integration.
-
Regarding the Ground Truth for Optimality: How was the modified Ramulator 2 simulator itself validated? Can you provide any data correlating your simulator's performance predictions with either real commercial PIM hardware (e.g., Samsung's HBM-PIM or AXDIMM) or a hardware vendor's internal, validated simulation models?
-
Regarding ILP Scalability: Beyond the reported runtimes, can the authors provide data on how the ILP solver time scales as operator complexity (e.g., number of loops, size of loop bounds, complexity of indexing functions) increases? Please characterize the point at which the "within 4 minutes" claim no longer holds.
-
Regarding the Mapping Representation: Please clarify why the proposed nested loop structure with 6 indexing combinations is fundamentally more expressive or efficient for PIM mapping than existing, more general compiler representations like polyhedral models, which are also amenable to ILP-based scheduling. What specific, valid mappings can your representation express that a polyhedral model cannot, or vice versa?
-
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Review Form: The Synthesizer
Summary
This paper presents OptiPIM, a framework for optimizing the mapping of data-intensive workloads onto digital Processing-in-Memory (PIM) architectures. The core contribution is the formulation of this complex mapping problem as an Integer Linear Programming (ILP) problem. The authors rightly argue that existing mapping techniques, designed for conventional ASIC accelerators, are insufficient for PIM due to its strict data layout requirements, which introduce significant costs related to data movement and replication that prior models neglect.
To enable this ILP formulation, the paper introduces two key technical innovations: 1) a novel, layout-aware nested loop representation that explicitly models the hierarchical structure of PIM systems (PU allocation, column allocation, and computation scheduling), and 2) more accurate cost models, particularly for estimating the number of unique memory accesses, which prove to be far more precise than the upper-bound estimates used in previous work. The framework is implemented in MLIR and evaluated on two representative digital PIM architectures (HBM-PIM and SIMDRAM). The results demonstrate that OptiPIM can find provably optimal mappings in minutes, delivering significant performance improvements (at least 1.9x, and much higher in many cases) over established heuristic and search-based methods.
Strengths
This is an excellent and highly significant piece of work. Its primary strength lies in bringing formal optimization to a problem space that has, until now, been dominated by heuristics and ad-hoc methods.
-
Clear and Compelling Problem Definition: The authors do a masterful job in Section 3 ("Motivation", page 3) of articulating why the PIM mapping problem is fundamentally different from the well-studied ASIC accelerator mapping problem. The example in Figure 2 is particularly illustrative, clearly showing how different mapping choices lead to vastly different data layouts, memory footprints, and data movement costs. This precise problem definition elevates the paper beyond a simple application of a known technique (ILP) to a new domain.
-
Foundational Contribution to PIM Compilation: By successfully formulating the PIM mapping problem for an ILP solver, OptiPIM provides an "oracle" for the community. It establishes a theoretical performance ceiling for a given operator on a given PIM architecture. This is invaluable; it allows researchers to quantify the performance gap of any new, faster heuristic they might develop. It moves the field from "this mapping seems fast" to "this mapping is X% of optimal."
-
Technically Sound and Novel Representation: The proposed layout-aware representation (Section 4, page 5) is well-suited to the problem. The explicit separation of PU allocation, column allocation, and scheduling maps cleanly to the physical realities of PIM hardware. Furthermore, the exploration of different data indexing functions (Section 4.3, page 6) is a subtle but powerful enhancement that expands the design space beyond what previous rigid mapping representations allowed. The accurate cost function for unique inputs (Section 5.3, page 8 and Figure 7) is a concrete improvement over prior art and is critical for the ILP solver to find a truly optimal solution.
-
Rigorous and Insightful Evaluation: The experimental methodology is comprehensive. The validation against a cycle-accurate simulator (Figure 9, page 10) builds confidence in the analytical model. The choice of baselines—including a state-of-the-art PIM heuristic, a search-based method, and an adapted ASIC mapping framework (COSA)—is strong and allows for a nuanced comparison. The performance breakdown plots (Figure 10, page 10) are particularly insightful, clearly showing that OptiPIM wins by minimizing the layout-dependent costs (like input loading) that other methods ignore or model poorly.
Weaknesses
The weaknesses of this paper are largely related to the inherent limitations of its chosen approach (ILP) and the scope of the problem addressed, rather than flaws in the execution of the research itself.
-
Scalability of the ILP Formulation: The classic Achilles' heel of ILP is its computational complexity. While the authors demonstrate impressive results with optimization times under 4 minutes per operator, this is for single operators. The complexity of the ILP problem can grow exponentially with the number of variables and constraints. It is unclear how this approach would scale to much larger, fused operators or if it were applied to an entire computational graph simultaneously, rather than on an operator-by-operator basis. The multi-operator optimization discussed in Section 5.6 is a good start but is not explored in depth.
-
Static Workload Assumption: The framework operates at compile-time and assumes that the dimensions of all tensors are known statically. This is a reasonable assumption for many vision models but is increasingly challenged by modern workloads, such as Transformers with variable sequence lengths or dynamic graph structures. The discussion in Section 7.7 acknowledges this, but it remains a significant practical barrier to applying OptiPIM directly in more dynamic execution environments.
Questions to Address In Rebuttal
The paper is excellent and I am strongly in favor of acceptance. The following questions are intended to help the authors strengthen the final version and contextualize the work's limitations.
-
On the Nature of Optimality: The paper focuses on optimizing individual operators. Could the authors comment on how the local optimality for each operator translates to global optimality for a full network? Is it possible that the optimal mapping for layer
Nresults in a data layout that is highly inefficient for consumption by layerN+1, thereby requiring a costly data re-layout that negates the gains? Does the multi-operator formulation in Section 5.6 account for these inter-operator layout transformation costs? -
On the Practicality for Compilers: Given the potential scalability concerns of ILP, do the authors envision OptiPIM being used directly in a production compiler, or do they see its primary role as an offline tool to generate a library of optimal mappings for common operator shapes? Alternatively, could the insights from the ILP solutions be used to train a machine learning model that could predict near-optimal mappings much more quickly?
-
On Architectural Generalization: The framework models two digital PIM architectures. How adaptable is the proposed representation and cost modeling to other classes of PIM, for instance, analog compute-in-memory (CIM) systems? What new constraints or cost functions (e.g., related to ADC/DAC precision, noise, or non-ideal device behavior) would need to be incorporated, and would the problem still be tractable within an ILP framework?
-
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Reviewer: The Innovator (Novelty Specialist)
Summary
The authors present OptiPIM, a framework for generating optimal software-hardware mappings for Processing-in-Memory (PIM) architectures using Integer Linear Programming (ILP). The central claim to novelty is not the use of ILP for mapping—a technique well-established in the ASIC accelerator domain—but rather the development of a novel, PIM-specific mapping representation and associated cost models that make the ILP formulation both feasible and accurate for the unique constraints of PIM. Specifically, the authors' claimed novel contributions are: (1) a layout-aware, multi-level nested loop representation that explicitly models PU allocation, column allocation, and computation scheduling; (2) a "Comprehensive Data Indexing" method that explores a bounded set of linear transformations for data indexing, moving beyond the rigid stride-1 indexing of prior work; and (3) a more accurate cost function for estimating data sizes, which is critical for modeling PIM's layout-dependent costs.
Strengths
The primary strength of this work, from a novelty perspective, lies in its precise and well-reasoned departure from prior art. The authors correctly identify that existing ILP-based mapping frameworks for ASICs (e.g., COSA [23]) are fundamentally unsuitable for PIM due to their assumptions about data layout and movement. The novel contributions directly address these identified gaps:
-
Comprehensive Data Indexing (Section 4.3, Page 6): This is the paper's most significant and genuinely novel contribution. Prior mapping frameworks typically assume a fixed, tiled indexing scheme (e.g.,
x = x_outer * tile_size + x_inner). The authors' formalization that multiple bijective linear transformations are possible and that for a 3-level tiling there are exactly 6 such combinations (Figure 6b) is a novel insight. Incorporating the choice between these functions as a variable within the ILP formulation is a powerful and original technique that significantly expands the explorable design space to find more optimal data layouts. -
Accurate Cost Modeling for Unique Indices (Section 5.3.3, Page 8): The second key novel element is the proposed cost estimation function for the number of unique inputs generated by a linear combination of loop variables (Equation 11). The paper provides a clear empirical demonstration in Figure 7 that the prior art's method (product of loop bounds) is highly inaccurate for this task, whereas their proposed function is very precise. This is not merely an incremental improvement; it is a necessary correction to make the ILP objective function meaningful for PIM, where data movement costs stemming from layout are dominant.
-
Structured PIM Mapping Representation (Section 4.2, Page 5): While hierarchical loop representations are not new in themselves, the specific three-level structure proposed (PU Allocation, Column Allocation, Computation Scheduling) provides a clean and effective abstraction for the PIM mapping problem. It correctly separates the different granularities of parallelism and data locality inherent to PIM architectures, making it amenable to a concise ILP formulation. This structured approach is a novel formalization tailored specifically for the PIM domain.
Weaknesses
The paper's claims of novelty should be carefully contextualized. While the specific formulation is new, the foundational components are built upon extensive prior work.
-
Overstated Novelty of the ILP Approach: The use of ILP to solve the accelerator mapping problem is not novel. This has been the core technique in works like COSA [23] and LEMON [52]. The authors' novelty lies entirely in the adaptation and formulation of the problem for PIM, not in the choice of ILP as a solver. The abstract and introduction could be clearer on this distinction, emphasizing that the novelty is the representation and modeling that enables an effective ILP solution for PIM.
-
Insufficient Differentiation from PIM-Specific Mapping Frameworks: The paper primarily contrasts its work against ASIC mapping frameworks. However, other frameworks exist for PIM mapping, such as ARES [7], which also proposes a general representation. While the authors correctly state that ARES uses search-based algorithms, the paper would be stronger if it included a more direct technical comparison of its representation to that of ARES in the main body. A brief discussion on why the ARES representation is unsuitable for an exact ILP formulation, and how OptiPIM's novel representation overcomes this, would more firmly establish its unique contribution within the PIM-specific literature.
Questions to Address In Rebuttal
-
Scalability of Comprehensive Indexing: Section 4.3 and Figure 6 show that for 3 loop levels, there are 3! = 6 valid coefficient combinations to explore. This seems tractable. However, how does this novel aspect of the formulation scale as the number of tiled loop levels (
l) increases? The number of combinations would grow asl!. At what point does this enumeration of indexing choices become a bottleneck that makes the ILP problem intractable to formulate or solve? -
Novelty vs. ARES Representation: The related work section mentions ARES [7], which proposes a "general representation" for mapping operators onto diverse PIM architectures. Could the authors please elaborate on the fundamental limitations of the ARES representation that make it ill-suited for an ILP-based optimization approach? What specific, novel features of the OptiPIM representation enable the concise and efficient ILP formulation presented in this work, which would not be possible if starting from the ARES abstractions?
-
Justification for Linearity of Transformation: The novel indexing in Section 4.3 explores linear transformations of the form
d*x2 + e*x1 + g*x0. Is there a theoretical or architectural justification for limiting the search to only linear functions? Have the authors considered whether non-linear indexing functions could provide further benefits, and if so, why were they excluded from the novel formulation?
-