No internet connection
  1. Home
  2. Papers
  3. ISCA 2025

ANSMET: Approximate Nearest Neighbor Search with Near-Memory Processing and Hybrid Early Termination

By Karu Sankaralingam @karu
    2025-11-04 04:56:06.795Z

    Approximate
    nearest neighbor search (ANNS) is a fundamental operation in modern
    vector databases to efficiently retrieve nearby vectors to a given
    query. On general-purpose computing platforms, ANNS is found not only to
    be highly memory-bound due to the ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-04 04:56:07.324Z

        Excellent. Here is a peer review of the paper from the perspective of "The Guardian."


        Review Form

        Reviewer: The Guardian (Adversarial Skeptic)

        Summary

        The authors present ANSMET, a DIMM-based near-data processing (NDP) system for accelerating Approximate Nearest Neighbor Search (ANNS). The core proposal is a "hybrid partial-dimension/bit early termination" (ET) algorithm, which aims to reduce memory traffic by calculating a lower-bound distance on partially fetched vectors and terminating further accesses if this bound exceeds a threshold. This ET scheme is supported by a sampling-based offline process to optimize the data layout for fetch efficiency. While the paper addresses a relevant problem, the central claims rest on a heuristic methodology whose robustness is not sufficiently established. Furthermore, the headline claim of "no accuracy loss" is conditional and potentially misleading, obscuring significant performance and storage overheads required to maintain it.

        Strengths

        1. The paper correctly identifies that a large fraction of distance computations in ANNS are for "rejected" vectors, providing a clear motivation for an early termination strategy.
        2. The evaluation is conducted across a reasonable set of seven diverse, large-scale datasets, which provides breadth to the experimental results.
        3. The proposed hardware architecture is a pragmatic extension of existing DIMM-based NDP concepts, and the breakdown of system components is clearly described.

        Weaknesses

        1. The "No Accuracy Loss" Claim is Misleading. The central premise of the early termination algorithm is that it guarantees no accuracy loss. However, a deep reading of Section 4.2 ("Offline common prefix elimination") and Table 5 reveals this is not strictly true under all configurations.

          • To achieve no accuracy loss for outlier vectors (those not matching the common prefix), the system must store the original, non-compressed vector separately and perform a "re-check." This incurs both storage overhead and, more critically, additional memory accesses (Table 5a shows up to 1.4% extra accesses for the default 0.1% outlier threshold). This performance penalty is not adequately factored into the main performance results and undermines the efficiency claims.
          • The alternative, which is implied to be more efficient, involves dropping the least significant bits of outlier elements (Figure 4c), which by the authors' own admission leads to significant accuracy loss (Table 5b shows a catastrophic -34.7% accuracy drop). The paper cannot simultaneously claim "no accuracy loss" as a general feature while also presenting and depending on configurations that either lose accuracy or incur non-trivial overhead to preserve it.
        2. The Sampling-Based Data Layout Optimization is Heuristic and Lacks Rigor. The entire data layout, which is critical to the performance of the hybrid ET, is determined by an offline sampling of just 100 vectors from billion-scale datasets (Section 4.2). This methodology is highly suspect.

          • The choice of parameters (nc, nF, TC) is entirely dependent on this minuscule sample being perfectly representative of both the global data distribution and, implicitly, the future query distribution. There is no analysis to demonstrate that this holds. What is the performance degradation if the sample is not representative?
          • The use of the 90th percentile distance from the sample as the reference threshold is arbitrary. Figure 11b shows this percentile yields the lowest KL divergence for the DEEP dataset, but there is no theoretical justification or evidence that this specific value is optimal or even suitable for other datasets. The methodology appears to be curve-fitting on a small sample.
        3. The Absolute Improvement in Fetch Utilization is Underwhelming. Figure 10 is presented as a success, showing that ANSMET (NDP-ETOpt) improves effectual data fetch utilization to 11.1% from a baseline of 6.0%. While this is a relative improvement, an 11.1% utilization rate is still exceptionally low. This suggests that even with the proposed complex optimizations, nearly 90% of all data fetched from DRAM remains useless. This fundamentally questions the overall efficacy of the approach; it makes an extremely inefficient process slightly less inefficient, rather than solving the core problem.

        4. Baseline Comparisons May Not Be Robust. The paper compares against NDP-DimET as its representation of prior art. However, it is not clear if this baseline is an optimized implementation of dimension-level ET or a strawman. State-of-the-art dimension-level ET techniques can be more sophisticated than what is implied here. Without a more detailed specification of this baseline, the claimed 1.52x speedup of the hybrid ET is not verifiable.

        Questions to Address In Rebuttal

        1. Please clarify the "no accuracy loss" claim. For the mode that guarantees accuracy, provide a rigorous analysis of the performance and storage overhead of the re-check mechanism across all datasets, not just SPACEV. The main results in Figure 6 should be updated to reflect this overhead.

        2. Justify the sampling methodology. Provide a sensitivity analysis showing how the final system performance (QPS) changes with (a) the number of vectors in the sample (e.g., 10, 100, 1000, 10000) and (b) the chosen distance threshold percentile (e.g., 70%, 80%, 90%, 95%). This is critical to prove the method is robust and not just a fragile heuristic.

        3. Provide a more detailed specification of the NDP-DimET baseline. What specific algorithm from prior work ([25, 69, 86]) does it implement? How was it optimized to ensure a fair comparison against your proposed hybrid technique?

        4. The final fetch utilization rate is still very low (~11%). Can the authors comment on the fundamental reasons for this remaining inefficiency? Does this suggest a ceiling on the effectiveness of any early termination strategy that operates at this level of granularity?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-04 04:56:17.799Z

            Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."


            Review Form

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper presents ANSMET, a hardware-software co-design that accelerates Approximate Nearest Neighbor Search (ANNS) by tackling its primary bottlenecks: memory bandwidth and wasted computation. The core contribution is the synthesis of two key ideas: (1) a practical DIMM-based Near-Data Processing (NDP) architecture to offload distance calculations, and (2) a novel, lossless "hybrid early termination" algorithm. This algorithm computes a conservative lower bound of the distance to a query vector by incrementally fetching and processing partial data—either a subset of dimensions or, more uniquely, a subset of the most significant bits within each dimension. This enables the system to terminate work on unpromising vectors much earlier, often before the full data has even been fetched from DRAM. The authors support this co-design with a systematic, sampling-based methodology for optimizing the in-memory data layout to maximize the effectiveness of early termination. The work is evaluated in a cycle-accurate simulation environment, demonstrating significant speedups on billion-scale datasets.

            Strengths

            The true strength of this paper lies in its holistic, problem-driven approach and its excellent positioning within the broader landscape of systems and algorithm research.

            1. Elegant Problem-Solution Coupling: The authors correctly identify the fundamental inefficiency in ANNS workloads: a large fraction of memory accesses and computations are for vectors that are ultimately rejected (as shown clearly in Figure 1, page 3). Their proposed hybrid early termination is not just a clever trick; it is an elegant solution directly tailored to this specific problem. By stopping data movement and computation at the earliest possible moment, it addresses the inefficiency at its root.

            2. A Novel Synthesis of Algorithmic Primitives: The concept of early termination is not new, nor is the idea of bit-serial processing. However, the authors' contribution is to create a powerful synthesis. They combine dimension-level termination (a known technique) with a novel bit-level termination, creating a hybrid approach. More importantly, they build the necessary system infrastructure around it, particularly the data layout optimization (Section 4.2, page 5), which transforms the raw algorithm into a practical systems solution. The use of prefix entropy to guide a dual-granularity fetch strategy is particularly insightful.

            3. Pragmatic Hardware Choices and Contextual Awareness: The choice of a DIMM-based NDP architecture, as opposed to a more exotic 3D-stacked memory approach, is a sign of mature systems thinking. The authors understand that billion-scale vector databases require capacity above all else, a domain where DIMMs excel. They correctly place their work in conversation with the rich body of literature on DIMM-based NDP for recommendation systems (e.g., RecNMP, TensorDIMM), noting the architectural similarities but highlighting the unique algorithmic opportunities that ANNS presents.

            4. Connecting Algorithm Theory to Hardware Reality: This work serves as an excellent case study in how abstract algorithms can be re-imagined in light of concrete hardware characteristics. The 64-byte granularity of a DRAM access is no longer just a parameter to be tolerated, but a constraint to be exploited. The entire data layout optimization scheme is built around making every 64-byte transfer as informative as possible for the early termination heuristic. This is a wonderful example of true hardware-software co-design.

            Weaknesses

            The weaknesses identified are less about fundamental flaws and more about the scope and positioning of the work, which could be expanded to clarify its broader applicability.

            1. Focus on a Single Indexing Family: The evaluation and discussion are heavily centered on HNSW, a graph-based index. While HNSW is state-of-the-art and highly representative, ANNS is a diverse field. It would strengthen the paper to discuss in more detail how the ANSMET approach would apply to other major index families, such as inverted file (IVF) based methods. The access patterns are different (more sequential scans within inverted lists), which might change the dynamics of early termination and load balancing across NDP units.

            2. Implications for Dynamic Databases: The proposed data layout transformation is an offline preprocessing step. This is perfectly reasonable for static datasets. However, a key feature of modern vector databases is support for incremental insertions and updates. A discussion on the practicality and overhead of applying these transformations in a dynamic environment would be valuable. Does a new vector need to be fully transformed before it's searchable? How does this impact ingestion latency?

            3. Under-explored Interaction with Vector Quantization: The paper briefly mentions compatibility with vector quantization (Section 4.3, page 7), but this interaction is critical. Product Quantization (PQ) and its variants are the industry standard for reducing the memory footprint of billion-scale indexes. Quantization fundamentally alters the bit-level representation and error profile of the data. It is not immediately clear how much benefit the bit-level termination provides on top of data that has already been compressed to 8-bit integers. A more detailed analysis or evaluation of this interplay would significantly increase the paper's practical impact.

            Questions to Address In Rebuttal

            1. Beyond HNSW, could the authors elaborate on the expected performance and potential challenges of applying ANSMET to an IVF-based index? Specifically, how would the more structured memory access patterns of scanning an inverted list interact with the proposed hybrid partitioning and adaptive polling schemes?

            2. Regarding the offline data layout optimization: What is the authors' vision for integrating this approach into a dynamic vector database that handles frequent insertions? Would the overhead of the bit-level transformations on newly inserted vectors negate some of the search-time benefits?

            3. Could the authors provide more intuition or preliminary data on the effectiveness of hybrid early termination when applied to vectors that have already undergone Product Quantization (PQ)? Does the common prefix elimination technique still yield significant benefits on the quantized codes, or does the benefit shift primarily to the partial dimension aspect of the technique?

            4. The current design cleanly separates index traversal (CPU) from distance comparison (NDP). This is a sensible division of labor. Looking forward, do the authors see any potential in offloading parts of the index traversal itself, such as fetching the neighbor lists in HNSW, to the NDP units to further reduce CPU-memory traffic?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-04 04:56:28.315Z

                Of course. Here is the peer review from the perspective of "The Innovator."


                Review Form

                Reviewer: The Innovator (Novelty Specialist)

                Summary

                The paper, "ANSMET," presents a hardware-software co-design to accelerate Approximate Nearest Neighbor Search (ANNS). The authors propose integrating a DIMM-based Near-Data Processing (NDP) architecture with a novel early termination (ET) strategy. The core claim of novelty resides in this ET strategy, which is a hybrid of partial-dimension and partial-bit evaluation. By fetching and processing a vector piece-by-piece (in terms of both dimensions and the bits within each dimension’s representation), the system computes a conservative lower bound on the distance. If this bound exceeds the current threshold, subsequent memory accesses and computations for that vector are terminated, saving bandwidth and compute cycles without accuracy loss. This core idea is supported by a sampling-based offline method to optimize data layouts for the ET strategy and a heterogeneous execution model where the host CPU handles index traversal and the NDP units handle distance calculations.

                Strengths

                1. Genuinely Novel Algorithmic Contribution: The central contribution—the hybrid partial-dimension/bit early termination—is a novel concept in the ANNS acceleration space. Prior work on ET has largely focused on either the vector level [52] or the dimension level [25, 69]. While bit-level ET has been explored in BitNN [32] for low-dimensional point clouds, that work was confined to bit-serial processing. ANSMET proposes a more generalized and flexible framework that combines dimension- and bit-level termination and, crucially, operates on variable-sized chunks of bits rather than serially. The "dual-granularity fetch" mechanism detailed in Section 4.2 (page 6) is a direct and novel outcome of this core idea.

                2. Strong Co-Design Principle: The paper successfully demonstrates a tight coupling between the novel algorithm and the system design. The algorithmic need for non-contiguous, prioritized data access (e.g., most significant bits first) directly motivates the non-trivial data re-layout scheme. The proposed sampling-based offline optimization to determine the fetch granularity (nc, nF) and common prefix length is a clever and novel engineering solution that makes the core algorithmic idea practical. This full-stack thinking, from algorithm to data layout to hardware, is a significant strength.

                3. Novel Application of Existing Concepts: The use of "offline common prefix elimination" (Section 4.2, page 6) is a creative application of a well-known compression technique to the problem of ANNS. Identifying and omitting low-entropy prefixes in vector elements to reduce data movement is a novel approach in this specific context.

                Weaknesses

                1. Architectural Foundation is Not Fundamentally New: While the paper claims to be the first DIMM-based NDP system for ANNS, the underlying hardware architecture is an adaptation of prior art. DIMM-based NDP systems with buffer-chip logic for accelerating high-dimensional vector operations are well-established in the context of recommendation systems [42, 47, 82]. The hardware described in Section 5 (page 7), including the QSHRs and parallel compute units, is conceptually very similar to these previous designs. The novelty is therefore not in the base architecture, but in its specific tailoring and co-design for the proposed ET algorithm. The paper should be more explicit about this distinction.

                2. Complexity Introduced by Offline Analysis: The entire system's effectiveness hinges on a complex offline preprocessing and data transformation stage. This stage requires statistical analysis of a data sample (e.g., prefix entropy, early termination frequency as shown in Figure 3, page 5) to determine optimal layout parameters. While the authors argue the cost is amortized, this dependency introduces a significant novelty-vs-practicality tradeoff. For highly dynamic vector databases with frequent insertions and updates, the novelty of this static optimization may become a practical bottleneck, as the learned layout could become stale.

                3. The "Delta" Over Prior Art May Be Data-Dependent: The performance gains from the most novel components (dual-granularity fetch and prefix elimination) appear to be highly dependent on the statistical properties of the dataset. For instance, if a dataset's vector elements have high-entropy prefixes, the benefit of common prefix elimination would be negligible. Similarly, if the early termination probability is not concentrated in a specific bit-range, the dual-granularity fetch may offer little advantage over a simpler, uniform-chunk approach. The novelty is sound, but its significance might not be universal across all possible high-dimensional vector distributions.

                Questions to Address In Rebuttal

                1. Regarding the hardware architecture, can the authors precisely delineate what aspects of their NDP unit design (Figure 5, page 7) are fundamentally novel, beyond the parameterization required for the hybrid ET algorithm? How does the required logic differ in principle from what was proposed in NDP systems for recommendation models like RecNMP [42] or TensorDIMM [47]?

                2. The novelty of the system relies heavily on offline analysis and a static data layout. How does the proposed ANSMET framework handle dynamic vector databases where new vectors are continuously inserted? Would this require periodic, system-wide re-analysis and data re-shuffling, potentially negating the online performance benefits?

                3. The core idea of prioritizing more significant bits is intuitive. However, could the authors comment on the robustness of their sampling-based parameter tuning? What is the performance degradation if the sampling set is not representative of the true query distribution, leading to suboptimal choices for the prefix length or the coarse/fine-grained bit steps (nc, nF)? This would help assess the true "delta" of this novel contribution in real-world, unpredictable scenarios.