No internet connection
  1. Home
  2. Papers
  3. ISCA 2025

DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign

By Karu Sankaralingam @karu
    2025-11-04 04:56:38.791Z

    Retrieval-
    augmented generation (RAG) supplements large language models (LLM) with
    information retrieval to ensure up-to-date, accurate, factually
    grounded, and contextually relevant outputs. RAG implementations often
    employ dense retrieval methods and ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-04 04:56:39.302Z

        Review Form

        Reviewer: The Guardian (Adversarial Skeptic)


        Summary

        The paper proposes DReX, an algorithmic-hardware co-design for accelerating dense vector retrieval, primarily for Retrieval-Augmented Generation (RAG) applications. The core idea is a two-stage process: 1) an in-DRAM filtering mechanism called Sign Concordance Filtering (SCF) that uses simple logic (PIM Filtering Units or PFUs) to prune the search space by comparing only the sign bits of query and corpus vectors, and 2) a near-memory accelerator (NMA) that performs an exact nearest neighbor search on the much smaller, filtered set of vectors. The system is architected as a CXL Type-3 device using LPDDR5X memory. The authors claim that DReX is dataset-agnostic, accurate, and significantly outperforms state-of-the-art ANNS methods on both CPU and GPU platforms.

        Strengths

        1. Clear Co-design Philosophy: The work presents a clear and compelling case for an algorithmic-hardware co-design approach. The tight integration of the SCF algorithm with the in-DRAM PFU logic and the specialized data layout (Section 5.2, page 6) is a well-considered piece of engineering.
        2. Detailed Architecture: The proposed hardware architecture is detailed and appears plausible. The choice of LPDDR5X over HBM is well-justified based on capacity and shoreline pin limitations (Section 5.4, page 7). The distribution of NMAs per memory package is a sound design choice for scalability.
        3. Inclusion of Ablation Study: The ablation study in Section 7.2 (page 12) is valuable. It effectively isolates the performance contributions of the near-memory exact search component (N/A--->NMAs) versus the full DReX system with in-memory filtering (PFUs--->NMAs), providing a clearer picture of where the speedups originate.

        Weaknesses

        1. Fundamental Flaw in the "Dataset-Agnostic" Claim: The central claim that DReX is "dataset-agnostic" (Abstract, page 1) is not only unsubstantiated but is directly contradicted by the authors' own analysis. The entire premise of Sign Concordance Filtering relies on the assumption that vector distributions are centered around zero, making similarity correlatable with sign agreement. The authors admit this reliance in Section 3 (page 4), stating "many of these embedding vectors demonstrate distributions... centered on or near zero." This is a strong assumption, not a general property. More damningly, the discussion in Section 8 and Figure 18 (page 14) explicitly demonstrates the catastrophic failure of SCF on a "pathologically constructed dataset" (i.e., a non-negative dataset), where filtering performance becomes worse than random. The proposed solution, Iterative Quantization (ITQ), is an admission that the core SCF algorithm is, in fact, highly dataset-dependent and requires a preprocessing step to enforce the necessary data properties. This preprocessing overhead is not evaluated, and its necessity fundamentally undermines the paper's core premise of generality.

        2. Misleading Performance Comparisons: The performance comparisons in Figure 11 (page 10) are presented as an algorithm-to-algorithm showdown, but they are fundamentally a platform-to-platform comparison. The DReX system is a custom-designed accelerator with an enormous internal memory bandwidth of 1.1 TB/s (Table 2, page 9). It is compared against general-purpose CPUs with 282 GB/s and GPUs with 3.35 TB/s of memory bandwidth. While the GPU has higher peak bandwidth, HNSW and other graph-based methods exhibit irregular access patterns that fail to saturate it, whereas DReX's design is tailored for sequential streaming. The massive speedups reported (e.g., 270x over CPU IVF-SQ) are therefore more indicative of the benefits of specialized high-bandwidth hardware for brute-force computation than the superiority of SCF over ANNS. A fair comparison would require acknowledging that the baselines are severely bandwidth-bottlenecked on their respective platforms.

        3. Insufficiently Rigorous Competitor Evaluation: The comparison against the ANNA accelerator (Section 7.1.3, page 11) is based on a "first-order model" constructed by the authors (Section 6, page 9). Comparing a detailed simulation of a proposed architecture against a high-level analytical model of a competing architecture is not a rigorous or convincing evaluation. This approach is susceptible to modeling errors and optimistic assumptions that could unfairly favor the authors' proposal.

        4. Questionable Power and Area Modeling: The power analysis in Section 7.4 (page 13) relies on applying a power breakdown model from an HBM paper (Lee et al. [40]) to their LPDDR5X-based system. The authors must provide justification for why power characteristics of these two very different memory technologies can be considered analogous. Furthermore, the PFU area overhead is calculated based on a synthesis in a 16nm logic process and then scaled, with an assumed 10x area penalty for a DRAM process (Section 6, page 8). This is a rough estimation, and the actual implementation costs of integrating non-trivial logic into a cutting-edge DRAM periphery could be substantially higher. The reported "modest" overheads rest on these fragile assumptions.

        Questions to Address In Rebuttal

        1. Please reconcile the claim of your method being "dataset-agnostic" with the evidence presented in Section 8 that SCF fails completely on non-zero-centered data. If the solution is to use ITQ, please provide a full evaluation of the computational and storage overhead of this mandatory preprocessing step and incorporate it into your end-to-end performance results.
        2. Can you justify the fairness of comparing your specialized, high-bandwidth hardware platform against general-purpose CPUs and GPUs? Please provide an analysis that decouples the gains from the SCF algorithm itself versus the gains from having a massive, dedicated memory bandwidth for the exact search phase. For instance, what is the performance if the baseline ANNS algorithms were run on a hypothetical platform with equivalent memory bandwidth to DReX?
        3. Please defend the decision to compare DReX against a "first-order model" of the ANNA accelerator rather than a more rigorous, published simulation framework or implementation. What steps were taken to validate that your model accurately represents the performance and bottlenecks of the ANNA architecture?
        4. Provide a more robust justification for your power and area modeling. Specifically, why is it valid to use an HBM power breakdown for an LPDDR5X system? What evidence supports the 0.1 mm² area for a PFU implemented in a real DRAM process, beyond the high-level 10x penalty assumption?
        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-04 04:56:49.799Z

            Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."


            Review Form

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper addresses a critical and timely bottleneck in modern AI systems: the performance of dense vector retrieval for Retrieval-Augmented Generation (RAG). The authors correctly identify the problematic trade-off between slow but accurate Exact Nearest Neighbor Search (ENNS) and fast but often inaccurate and dataset-specific Approximate Nearest Neighbor Search (ANNS).

            The core contribution is DReX, a compelling algorithmic-hardware co-design that aims to deliver the accuracy of ENNS with performance surpassing ANNS. The proposal is built on two key ideas:

            1. An elegant and computationally inexpensive algorithm called Sign Concordance Filtering (SCF), which uses the sign bits of vector dimensions to perform a high-throughput, online filtering of a vector database.
            2. A hierarchical hardware architecture that implements this co-design, featuring in-DRAM PIM Filtering Units (PFUs) to execute SCF with massive parallelism, and near-memory accelerators (NMAs) to perform an exact similarity search on the small, filtered set of candidate vectors.

            The authors present a holistic system design, including specific DRAM data layouts and a CXL-based integration strategy. Their evaluation, culminating in a 6.2-7x reduction in time-to-first-token for a representative RAG application, convincingly demonstrates the system's potential.

            Strengths

            1. Excellent Problem-Solution Fit: The paper targets a high-impact, real-world problem. The retrieval step is a well-known performance and quality limiter for RAG. The proposed solution is not merely an acceleration of an existing algorithm but a ground-up rethinking of the problem from a co-design perspective, which is precisely the right approach.

            2. Elegance of the Core Algorithm: The Sign Concordance Filtering (SCF) method (Section 4, page 4) is the paper's conceptual jewel. It is simple, intuitive, and, most importantly, exceptionally well-suited for hardware implementation (requiring little more than bitwise XORs and popcounts). This avoids the complexity of trying to implement something like graph traversal (HNSW) in hardware and instead creates an algorithm that thrives on the vast, simple parallelism available in DRAM.

            3. Holistic and Credible System Design: This is a strong systems paper. The authors have considered the full stack, from the algorithm down to the data layout in memory (Section 5.2, page 6), the PIM logic in the DRAM periphery (Section 5.3, page 6), the near-memory compute unit (Section 5.4, page 7), and the system integration via CXL. The justification for using LPDDR5X over HBM is well-reasoned and adds to the design's credibility. This end-to-end thinking is a significant strength.

            4. Connecting Architectural Gains to Application-Level Impact: A major strength of the evaluation is the direct line drawn from the retrieval throughput improvements (Figure 11, page 10) to the reduction in application-level time-to-first-token (TTFT) (Section 7.3, Figure 15, page 12). This is often missing in architecture papers, which can get lost in micro-benchmarks. By showing a tangible benefit to the end-user of an LLM application, the authors make a powerful case for their work's significance.

            5. Anticipation of Limitations: The discussion in Section 8 (page 14) proactively addresses the most obvious critique of SCF: its dependency on data distributions being centered around zero. By showing that a standard technique like Iterative Quantization (ITQ) can effectively mitigate this pathological case, the authors substantially strengthen their claims of generality and robustness.

            Weaknesses

            While this is an excellent paper, its positioning within the broader landscape of in-memory processing could be strengthened.

            1. Context within the Broader History of In-Memory Filtering: The architectural pattern of using simple in-memory logic for coarse-grained filtering followed by more powerful near-memory processing for fine-grained evaluation is a classic idea in the database accelerator and processing-in-memory communities. While the application to dense vectors for RAG is novel and the specific SCF algorithm is new, the paper would benefit from acknowledging this lineage. Placing DReX as the latest and most sophisticated application of this long-standing pattern would strengthen its academic context rather than weaken its novelty.

            2. Limited Exploration of the Algorithmic Design Space for In-DRAM Filtering: SCF is presented as the primary solution, and it is a very good one. However, the paper could be even more impactful by briefly discussing why SCF is the right choice compared to other potential hardware-friendly, online filters. For instance, were other simple primitives considered, such as filtering based on a few key quantized dimensions, or a simple form of Locality-Sensitive Hashing (LSH)? A short discussion justifying the choice of SCF over these alternatives would add depth.

            3. The Economic and Practical Viability of Modified DRAM: The proposed system relies on custom logic within the DRAM die (the PFUs). This is a notoriously high bar for adoption in the industry. While the power and area analysis (Section 7.4, page 13) is good, the paper could benefit from a paragraph discussing the path to adoption. Is this a feature that could be standardized by JEDEC? Could it be a high-margin custom product for a specific hyperscaler? Acknowledging this practical hurdle and suggesting a path forward would make the work more complete.

            Questions to Address In Rebuttal

            1. The success of this work is predicated on the elegance of the SCF algorithm. Did the authors consider or prototype other simple, hardware-amenable online filtering techniques (e.g., based on quantized centroids, or a subset of vector dimensions)? A brief discussion on why SCF was chosen over potential alternatives would be enlightening.

            2. Can the authors further contextualize their contribution in relation to the broader history of in-memory/near-data filtering for search and database applications? While the application to RAG is new, the architectural pattern feels familiar. Acknowledging this and highlighting what makes the DReX design uniquely suited for high-dimensional vectors would strengthen the paper's positioning.

            3. The proposed system requires significant modifications to the memory subsystem (custom DRAM and NMAs). Could the authors comment on the cost-benefit trade-off from a Total Cost of Ownership (TCO) perspective? How does the cost of a DReX system compare to a baseline system that achieves similar performance by simply scaling out more commodity servers with GPUs running an optimized ANNS library?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-04 04:57:00.310Z

                Review Form

                Reviewer: The Innovator (Novelty Specialist)

                Summary

                This paper presents DReX, an algorithmic-hardware co-design for accelerating dense vector retrieval, primarily for Retrieval-Augmented Generation (RAG) applications. The authors identify the core novelty as a two-stage process: 1) A computationally lightweight, online filtering algorithm called Sign Concordance Filtering (SCF), which is implemented using in-DRAM Processing-in-Memory (PIM) logic (PFUs). This stage prunes the vast majority of non-candidate vectors without moving them off-chip. 2) An exact nearest neighbor search (ENNS) performed on the much smaller, filtered set of vectors using a near-memory accelerator (NMA).

                The central novel claim is not the algorithm or the hardware in isolation, but their tight co-design. Specifically, the identification of an algorithm (SCF) that is simple enough for efficient PIM implementation yet effective enough to drastically reduce the search space for a subsequent, more complex near-memory processing stage.

                Strengths

                1. Novelty of the Co-Design: The primary strength of this work is the symbiotic relationship between the proposed algorithm and hardware. While PIM and near-memory accelerators for search are not new concepts, the authors have identified a particularly elegant primitive (Sign Concordance Filtering) that is exceptionally well-suited for in-DRAM implementation (bitwise XOR and accumulation). This avoids the complexity of implementing more sophisticated indexing or hashing schemes in memory. This specific co-design appears to be novel.

                2. Online vs. Offline Filtering: The proposed SCF is an online filter, distinguishing it from the dominant paradigm of offline index construction used in ANNS methods like HNSW and IVF. This is a significant conceptual departure. By avoiding a rigid, pre-computed index structure, DReX offers flexibility for dynamic datasets where vectors are frequently added or updated, a point briefly touched upon in Section 8 (page 14). This online nature is a key part of its novelty.

                3. Clear Delta from Prior Art: The authors build upon existing work, including what appears to be their own (IKS [61]), which focused on near-memory ENNS acceleration. The novelty of DReX is clearly articulated as the addition of the in-DRAM SCF pre-filtering stage (the PFUs). The paper effectively demonstrates that this new component is responsible for the majority of the performance gain over a pure near-memory ENNS accelerator. This represents a significant and non-obvious extension of prior art.

                Weaknesses

                1. Algorithmic Proximity to LSH: The core SCF mechanism, while presented as a novel heuristic, bears a strong conceptual resemblance to certain variants of Locality Sensitive Hashing (LSH), particularly SimHash, where the sign of dot products with random vectors forms a hash. SCF uses the signs of the vector's own components, which can be viewed as dot products with axis-aligned basis vectors. While functionally distinct (no random projections), the underlying principle of using sign agreement as a proxy for angular similarity is a well-established concept. The paper would be stronger if it more directly confronted this similarity and provided a clearer analysis of why its direct, axis-aligned approach is superior to a PIM-implemented LSH scheme for this specific problem.

                2. Limited Novelty of the NMA Component: The paper acknowledges in Section 7.1 (page 10) that the ENNS-only configuration is "equivalent to IKS [61]". This implies the near-memory accelerator (NMA) architecture for the second stage is not, in itself, a novel contribution of this work. The novelty is therefore confined to the PFU filtering logic and the system-level pipeline connecting the PFU and NMA. This should be stated more explicitly upfront in the architectural description (Section 5) to precisely frame the paper's contribution.

                3. Generality of the Core Mechanism: The effectiveness of SCF is predicated on the assumption that vector embeddings are distributed somewhat symmetrically around zero. The authors rightly identify this limitation in Section 8 (page 14) and show a pathological case where performance collapses. While they propose a known technique (ITQ) as a remedy, this highlights that the core novel mechanism is not universally applicable without data pre-processing. The novelty is therefore in a mechanism that works exceptionally well for a specific, albeit common, data distribution.

                Questions to Address In Rebuttal

                1. Please clarify the novelty of Sign Concordance Filtering (SCF) with respect to established LSH techniques. Could you provide a brief theoretical or empirical comparison against a PIM-amenable LSH scheme (e.g., a single-table SimHash)? What are the specific trade-offs in terms of hardware complexity, filtering efficacy, and recall that make SCF a superior choice for this co-design?

                2. The paper discusses a pathological case (Figure 18, page 14) requiring ITQ pre-processing. In a dynamic RAG environment where new documents are constantly added, what is the overhead of applying or updating an ITQ rotation to the corpus? Does this requirement for a global data transformation compromise the "simple update" story that is a key advantage of DReX over ANNS methods?

                3. Could the authors be more precise about the novel contributions within the Near-Memory Accelerator (NMA) architecture itself? Beyond leveraging the design from IKS [61], are there any specific modifications or optimizations made to the NMA to better handle the sparse and temporally unpredictable stream of candidate vectors produced by the PFU stage? Or is the novelty purely in the addition of the PIM front-end?