No internet connection
  1. Home
  2. Papers
  3. ISCA 2025

HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation

By Karu Sankaralingam @karu
    2025-11-04 04:49:41.267Z

    By integrating external knowledge bases,Retrieval-augmented Generation(RAG) enhances natural language generation for knowledge-intensive
    scenarios and specialized domains, producing content that is both more
    informative and personalized. RAG systems ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-04 04:49:41.811Z

        Review Form: The Guardian


        Summary

        This paper, "HeterRAG," proposes a heterogeneous Processing-in-Memory (PIM) architecture to accelerate Retrieval-augmented Generation (RAG) workloads. The authors identify that the two primary stages of RAG—retrieval and generation—have distinct system requirements. Retrieval is characterized by random memory access over large datasets, demanding high capacity, while generation is memory-bandwidth intensive. To address this, they propose a system combining low-cost, high-capacity DIMM-based PIM ("AccelDIMM") for the retrieval stage and high-bandwidth HBM-based PIM ("AccelHBM") for the generation stage. The system is further enhanced by three software-hardware co-optimizations: locality-aware retrieval, locality-aware generation, and a fine-grained parallel pipeline. The evaluation, conducted through a simulation framework, claims significant throughput and latency improvements over CPU-GPU and other PIM-based baselines.


        Strengths

        1. Well-Motivated Problem: The paper correctly identifies a critical and timely problem. The characterization of RAG workloads in Section 3, using execution breakdowns and roofline models (Figures 2, 3, 4), provides a solid foundation for the proposed solution. The analysis clearly establishes that both stages are memory-bound but with different characteristics, justifying a heterogeneous approach.

        2. Logical High-Level Design: The core architectural concept—mapping the capacity-demanding retrieval stage to DIMM-PIM and the bandwidth-demanding generation stage to HBM-PIM—is sound. The authors rightly point out the futility of a naive approach where data is shuttled from DIMMs to HBM over a slow interconnect (Section 3.2), thereby motivating the need for compute capabilities on the DIMM side.

        3. Inclusion of Relevant Baselines: The study includes "NaiveHBM" and "OnlyDIMM" baselines, which are crucial for validating the central hypothesis. The poor performance of NaiveHBM effectively demonstrates the interconnect bottleneck, while the comparison against OnlyDIMM helps isolate the benefits of using HBM for the generation stage.


        Weaknesses

        My primary concerns with this manuscript center on the validity of key performance claims, the ambiguity of critical mechanisms, and the overall robustness of the evaluation methodology.

        1. Unsubstantiated Performance Claims and Questionable Scaling: The claim of "near-superlinear throughput improvement" for the retrieval stage (Section 5.4, page 11) is a significant red flag. Superlinear speedup is exceptionally rare and requires a strong theoretical justification, such as caching effects that scale non-linearly with the number of nodes. The paper attributes this to "data parallelism," which at best explains linear scaling. Without a rigorous explanation, this claim undermines the credibility of the entire evaluation. The headline performance numbers ("up to 26.5x") are also potentially misleading, as is common with "up to" metrics, and may not reflect average-case behavior.

        2. Ambiguity in Core Optimization Mechanisms: The "fine-grained parallel pipeline" is presented as a key contribution (Section 4.4, page 9), but its implementation details are critically underdeveloped. The paper states the host "aggregates retrieval results at fixed intervals" and sends "high-confidence results" ahead. This is vague. How is the interval determined? What is the sensitivity of the system to this hyperparameter? What is the precise, non-heuristic logic for identifying a result as "high-confidence"? The efficacy of this entire optimization hinges on these details, which are absent from the paper.

        3. Weakness of Evaluation Methodology: The entire evaluation rests on a simulation framework combining Ramulator and ZSim (Section 5.1, page 10). While simulation is a standard practice, this work fails to account for several real-world complexities:

          • Interconnect Modeling: The high-level interconnect is specified as CXL, but the performance impact of the CXL switch network, protocol overhead, and coherence traffic is not discussed or apparently modeled in detail. These factors can introduce non-trivial latency and limit scalability.
          • Baseline Hardware: The CPU-GPU baseline uses an NVIDIA V100 GPU. While a strong GPU, it is now two generations old. A comparison against a more contemporary architecture (e.g., H100) with significantly higher memory bandwidth and advanced features would provide a much more realistic assessment of HeterRAG's claimed benefits. The chosen baseline may artificially inflate the reported speedups.
          • Comparisons to Prior Work: The comparisons in Section 5.5 are made against results reported in other papers. This is not a scientifically rigorous method, as underlying experimental assumptions (e.g., system configuration, simulator parameters, benchmarks) are invariably different. These comparisons are suggestive at best and cannot be considered conclusive proof of superiority.
        4. Incremental Novelty of Components: While the system-level integration is novel, the individual components appear to be implementations of existing ideas. The locality-aware generation is explicitly "inspired by a recent study [87]," the PIM architectures build upon concepts from AttAcc [64] and Newton [25], and vertex caching for retrieval is a standard technique. The paper needs to more clearly articulate the novel architectural contributions beyond the high-level system concept.


        Questions to Address In Rebuttal

        The authors must provide clear and concise answers to the following questions to justify the paper's claims:

        1. Regarding "Near-Superlinear" Scaling: Please provide a detailed, evidence-backed explanation for the observed near-superlinear scaling of the AccelDIMM devices. What is the underlying architectural or algorithmic phenomenon that causes the system to scale better than linearly? Standard data parallelism does not suffice as an explanation.

        2. Regarding the Fine-Grained Pipeline: Please elaborate on the scheduling algorithm for the fine-grained pipeline. Specifically:

          • How is the aggregation interval determined, and how sensitive is overall performance to this value?
          • What is the exact criterion used by the host to classify a partial retrieval result as "high-confidence" and thus suitable for early forwarding to the generation stage?
        3. Regarding Evaluation Baselines:

          • Can you justify the choice of the V100 GPU as the primary baseline, given the existence of newer architectures with substantially higher memory bandwidth and compute power?
          • In the "OnlyDIMM" baseline, how does the design of the bank-level generation unit (BPM) compare in terms of computational throughput and efficiency to the dedicated AccelHBM device? Please clarify if this is a fair, apples-to-apples comparison of generation capability.
        4. Regarding Interconnect Overheads: The CXL interconnect is not lossless. What latency and bandwidth assumptions were made for the interconnect in your simulation, and how does system performance degrade as interconnect latency increases or effective bandwidth decreases due to protocol overhead?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-04 04:49:52.307Z

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper presents HeterRAG, a heterogeneous Processing-in-Memory (PIM) architecture designed to accelerate Retrieval-Augmented Generation (RAG) workloads. The core contribution is the insightful observation that the two primary stages of RAG—retrieval and generation—have fundamentally different memory requirements. Retrieval is characterized by large data capacity needs and irregular, random memory access patterns, while generation is dominated by high-bandwidth, more regular access.

            Instead of proposing a monolithic PIM solution, the authors advocate for a specialized, heterogeneous system. They map the capacity-intensive retrieval stage to low-cost, high-capacity DIMM-based PIM (AccelDIMM) and the bandwidth-intensive generation stage to high-performance HBM-based PIM (AccelHBM). This architectural separation is complemented by several software-hardware co-optimizations, including locality-aware caching and a fine-grained pipeline to overlap retrieval and generation. The paper demonstrates through simulation that this approach significantly outperforms conventional CPU-GPU systems and more naive PIM configurations in terms of throughput, latency, and energy efficiency.

            Strengths

            1. Excellent Problem-Architecture Mapping (The Core Insight): The paper's primary strength is its clear-eyed identification of RAG as a workload with two distinct phases whose memory characteristics map beautifully onto the two major PIM technologies available today. The retrieval stage, with its massive knowledge bases, is a natural fit for the capacity and cost profile of DIMM-based PIM. The generation stage, bottlenecked by GEMV operations during autoregressive decoding, is a perfect candidate for the high bandwidth of HBM-based PIM. This is not just an application of PIM; it is a thoughtful synthesis of the right tool for the right job, which represents a significant step forward in thinking about system design for complex, multi-stage AI workloads.

            2. Strong Grounding in a Critical Workload: The paper addresses a problem of immense practical importance. RAG is rapidly becoming the de facto standard for building knowledgeable and factual AI systems. By focusing on the system-level bottlenecks of this entire workflow, rather than just LLM inference in isolation, the work is highly relevant and has the potential for significant impact. The characterization study in Section 3.1 (page 4), using roofline models and execution breakdowns, provides a compelling, data-driven motivation for the proposed architecture.

            3. Holistic and Plausible System Design: The authors present more than just a pair of accelerators; they propose a complete system. The inclusion of a CXL-based interconnect, a clear host-device execution flow (Section 4.1, page 5), and a conceptual software stack (Figure 11, page 9) shows a mature approach to system design. This holistic view makes the proposal more credible and provides a clearer path toward a real-world implementation.

            4. Connects Disparate Research Threads: This work serves as an excellent bridge between two very active but often separate areas of architecture research: PIM for LLM inference (e.g., AttAcc, NeuPIMs) and PIM for graph/search algorithms (e.g., RecNMP). By building a system that requires both, HeterRAG effectively synthesizes techniques from both domains, demonstrating how they can be complementary components in a larger system. The authors explicitly acknowledge their debt to prior work (Section 4.2 and 4.3), which is commendable.

            Weaknesses

            While the core idea is strong, the work could be better contextualized and its limitations more thoroughly explored.

            1. The Evolving Nature of RAG: The proposed architecture is tightly coupled to the current dominant RAG paradigm: graph-based ANNS for retrieval followed by autoregressive transformer decoding. However, the RAG space is evolving rapidly. Future techniques might involve different search indexes, simultaneous retrieval and generation, or non-autoregressive models. The paper could benefit from a discussion on the architecture's adaptability to these potential algorithmic shifts. How much of the proposed hardware is special-purpose versus programmable?

            2. Understated Role of the Interconnect: The paper uses CXL as the interconnect fabric, which is a sensible choice. However, as the system scales up with many AccelDIMM and AccelHBM units, the all-to-some communication pattern (where retrieved results from many AccelDIMMs are gathered by the host and sent to AccelHBMs) could become a bottleneck. The analysis assumes this overhead is minimal, but a more detailed projection of interconnect traffic under high load would strengthen the scalability claims made in Section 5.4 (page 11).

            3. Generalization Claims: In the discussion (Section 4.6, page 9), the authors suggest HeterRAG is well-suited for other workloads like graph processing and recommendation systems. While this is conceptually plausible, the paper is squarely focused on RAG. These claims, while interesting, are speculative without supporting data and might be better framed as promising avenues for future work.

            Questions to Address In Rebuttal

            1. The fine-grained parallel pipeline optimization (Section 4.4, page 9) is an elegant way to hide latency by overlapping retrieval and generation. Could you provide a more quantitative analysis of its benefit? For instance, what is the typical distribution of retrieval completion times within a batch for your test workloads, and how much of the potential idle time does this pipelining strategy effectively reclaim?

            2. The architectural design makes a strong commitment to graph-based ANNS for retrieval. How would the AccelDIMM design need to change to support other popular retrieval methods, such as those based on inverted file indexes (IVF) or product quantization (PQ)? Is the core idea of a DIMM-PIM/HBM-PIM split robust to changes in the underlying retrieval algorithm?

            3. Could the authors elaborate on the data path for the retrieved results? The text describes the host aggregating results from AccelDIMMs, mapping vector IDs to documents, and then sending tensors to AccelHBMs. At scale, could this "host-as-router" model become a performance or bandwidth bottleneck? Have you considered a more direct data path between the DIMM and HBM subsystems for certain RAG variants?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-04 04:50:02.814Z

                Reviewer: The Innovator (Novelty Specialist)

                Summary

                This paper presents HeterRAG, a heterogeneous Processing-in-Memory (PIM) system designed to accelerate Retrieval-Augmented Generation (RAG) workloads. The core architectural proposal is to combine two distinct types of PIM: a high-capacity, DIMM-based PIM (AccelDIMM) for the retrieval stage (specifically, graph-based ANNS) and a high-bandwidth, HBM-based PIM (AccelHBM) for the generation stage (LLM inference). The authors claim this heterogeneous approach overcomes the capacity/cost limitations of HBM-only systems and the bandwidth limitations of DIMM-only systems. The architecture is supported by three software-hardware co-optimizations: locality-aware retrieval (caching), locality-aware generation (a hardware-accelerated KV cache scheme based on prefix trees), and a fine-grained parallel pipeline to overlap the two stages.

                Strengths

                From a novelty perspective, the paper's strengths lie not in its high-level architectural concept, but in its specific, system-level co-design choices that are tightly coupled to the RAG workload.

                1. Hardware Acceleration of a Recent Software Technique: The most novel contribution is the "locality-aware generation" mechanism (Section 4.4, page 8). The idea of combining prefix trees with selective recomputation for KV cache management is itself very recent, with the authors citing a 2025 paper [87] (CacheBlend). The design of dedicated hardware units (Tree Search Unit, KV Substitution Unit, Token Filtering Unit shown in Figure 9) to accelerate this specific software technique is a genuinely new hardware-software co-design contribution.
                2. Nuanced Pipelining: The "fine-grained parallel pipeline" (Section 4.4, page 9) demonstrates a more sophisticated approach than simple coarse-grained overlapping of retrieval and generation. The proposed mechanism of periodically aggregating partial retrieval results and forwarding high-confidence candidates to the generation stage is a clever system-level optimization that exploits the known behavior of best-first graph search algorithms. This is a non-obvious and specific contribution.

                Weaknesses

                My primary concern with this paper is the overstatement of novelty regarding its core architectural framework. While the integration is complex and the results are compelling, the fundamental ideas are largely derivative of prior art.

                1. The Heterogeneous PIM Concept is Not Fundamentally New: The central claim of novelty is a heterogeneous PIM system. However, this is a logical, if not obvious, application of existing principles. The RAG workload is cleanly divisible into two phases with opposing memory requirements: retrieval (high capacity, random access) and generation (high bandwidth, streaming access). Mapping these to DIMM-based PIM and HBM-based PIM, respectively, is a natural system design choice rather than a groundbreaking architectural innovation. The concept of heterogeneous computing and memory systems is well-established.

                2. Component PIM Architectures Are Based Heavily on Prior Work:

                  • PIM for Retrieval: The use of DIMM-based PIM for accelerating ANNS has been explored. The related work section itself cites MemANNS [15] and DRIM-ANN [14], which use commercial DIMM-PIM for this task. The design of AccelDIMM (Section 4.2, page 6) is an engineering contribution that adopts established techniques like rank-level processing and instruction compression from prior works such as RecNMP [39] and TRiM [65]. The novelty delta here is incremental.
                  • PIM for Generation: Similarly, accelerating transformer inference with HBM-based PIM is a very active area of research. The paper explicitly states that AccelHBM (Section 4.3, page 7) "adopt[s] the same mapping scheme as AttAcc [64]" and draws inspiration from "Newton [25]". Therefore, the novelty of AccelHBM itself is minimal; it is an application of known techniques.
                3. "Locality-Aware Retrieval" is Standard Practice: The first co-optimization, "locality-aware retrieval" (Section 4.4, page 8), is described as caching frequently accessed vertex vectors and reusing search results as starting points in iterative queries. These are standard caching and heuristic optimization techniques, respectively. While applying them is necessary for a high-performance system, it does not constitute a novel research contribution.

                In essence, the paper combines two known PIM acceleration strategies (PIM for ANNS and PIM for Transformers) into a single system. The novelty is in the integration and the two more advanced co-optimizations, not in the headline architectural concept itself.

                Questions to Address In Rebuttal

                The authors should use the rebuttal to clarify and defend the precise novelty of their contributions.

                1. The paper presents the heterogeneous HBM+DIMM architecture as its primary contribution. Given that the components (AccelHBM, AccelDIMM) are heavily based on prior art ([64], [39], [65], etc.), could the authors precisely articulate the novel architectural insight beyond the mapping of RAG stages to suitable memory technologies? What is the fundamental architectural challenge in this integration that this work is the first to solve?
                2. Regarding the fine-grained pipeline (Section 4.4, page 9), prior work like PipeRAG [34] also proposes aggressive overlapping of retrieval and generation. Please clarify the key difference and novel step that your interval-based, confidence-aware aggregation mechanism provides over such prior art.
                3. Could the authors re-frame their primary contribution? Is it the heterogeneous architecture itself, or is the main contribution the hardware-software co-design for locality-aware generation (accelerating [87]) and the fine-grained pipeline, which are enabled by a heterogeneous architecture? Clarifying this would help position the work more accurately within the literature.