No internet connection
  1. Home
  2. Papers
  3. ASPLOS 2025 V2

pulse:Accelerating Distributed Pointer-Traversals on Disaggregated Memory

By Karu Sankaralingam @karu
    2025-11-02 17:23:41.317Z

    Caches
    at CPU nodes in disaggregated memory architectures amortize the high
    data access latency over the network. However, such caches are
    fundamentally unable to improve performance for workloads requiring
    pointer traversals across linked data ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-02 17:23:41.846Z

        Paper Title: PULSE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory
        Reviewer: The Guardian


        Summary

        The authors present PULSE, a framework designed to accelerate pointer traversals in a rack-scale disaggregated memory environment. The system is composed of an iterator-based programming model, a novel accelerator architecture with disaggregated logic and memory pipelines, and an in-network distributed traversal mechanism leveraging a programmable switch. The authors claim that this design provides expressiveness, performance, energy efficiency, and scalable distributed execution, a combination they argue is missing from prior work. The paper includes an evaluation of an FPGA-based prototype against several baselines.

        However, my review finds that several key claims regarding expressiveness, energy efficiency, and the practical benefits of its distributed traversal mechanism are not sufficiently substantiated by the provided evidence. The work rests on a foundation of speculative estimations and an evaluation whose scope does not fully support its ambitious "rack-scale" claims.

        Strengths

        1. Problem Motivation: The paper correctly identifies a critical and well-known performance bottleneck for disaggregated memory architectures: pointer-chasing workloads that exhibit poor locality and defeat traditional caching mechanisms.
        2. Core Accelerator Insight: The architectural principle of disaggregating logic and memory pipelines (Section 4.2, Page 7) is sound. The justification, based on the observation that logic time (tc) is typically much shorter than memory fetch time (td) for these workloads, is logical and provides a solid basis for the hardware design.
        3. Prototype Implementation: The development of a real-system FPGA prototype (Section 4.2, Page 9) is a commendable effort. It provides a degree of validation that is superior to simulation-only studies and grounds the performance results in reality, albeit with the caveats of a prototype.
        4. Evaluation Breadth: The authors have compared PULSE against a reasonable set of baselines, including cache-only, CPU-based RPC, and ARM-based RPC systems, across three distinct and relevant real-world applications (Section 6, Page 10).

        Weaknesses

        1. Overstated "Expressiveness" of the Programming Model: The authors claim their iterator abstraction preserves expressiveness (Abstract, Section 1). However, the model imposes a severe "bounded computation" constraint, explicitly disallowing unbounded or data-dependent loops within an iteration (Section 3, Page 5). This fundamentally limits the complexity of logic that can be offloaded. While the provided examples (hash table, B+-tree) fit this model, it is highly questionable whether more complex graph traversal algorithms (e.g., those with unpredictable branching or nested loops) could be implemented without significant refactoring or falling back to the CPU. The claim of general expressiveness is therefore not supported.
        2. Energy Claims are Fundamentally Unsubstantiated: The energy consumption analysis (Section 6.1, Figure 8, Page 12) is the most significant flaw in this paper. The PULSE-ASIC results are not measurements but rather estimations derived by scaling FPGA power numbers using a methodology from a nearly two-decade-old paper [95]. The validity of applying this 2006 methodology to modern process nodes and architectures is highly suspect. Furthermore, the energy figures for the RPC-ARM baseline are also an estimation, not a direct measurement. Relying on speculative estimations for two of the key comparison points undermines the entire energy efficiency claim and does not meet the standards of rigor for this conference.
        3. Marginal End-to-End Benefit of the In-Network Traversal: The primary novelty of PULSE is its support for distributed traversals via a programmable switch. The authors claim this "cuts the network latency by half a round trip time" (Section 5, Page 9). However, the empirical evidence presented does not show a dramatic benefit. In the head-to-head comparison between PULSE and a variant that returns to the CPU (PULSE-ACC in Figure 9, Page 12), the end-to-end latency improvement is modest, appearing to be in the 15-20% range for the two-node case. While an improvement, it falls far short of the theoretical "halving" of network latency, suggesting that other overheads dominate or the benefit is less significant in practice than claimed. This discrepancy between the strong claim and the measured result is a critical issue.
        4. Scalability is Assumed, Not Proven: The paper repeatedly refers to "rack-scale" deployments. Yet, the experimental evaluation is limited to a maximum of four memory nodes (Figure 7, Page 11). This is hardly rack-scale. The proposed hierarchical translation places a global address translation table on the switch (Figure 6, Page 9). The paper provides no analysis of the scalability of this centralized component. At a true rack scale with potentially hundreds of nodes and thousands of fine-grained memory allocations, this table's size, update overhead, and lookup latency could easily become a significant bottleneck. The authors have not provided any evidence that their design can scale beyond their small testbed.
        5. Inconsistent Application of Baselines: In the evaluation, the Cache+RPC (AIFM) baseline is restricted to a single node for the B+-Tree workloads because, as the authors state, it "does not natively support... distributed execution" (Section 6.1, Page 10). While technically correct, this sidesteps a rigorous comparison. The purpose of a baseline is to establish a state-of-the-art comparison. By simply omitting the data points, the authors avoid demonstrating how much better PULSE is than a distributed version of AIFM, or discussing the complexity required to build one. This weakens the comparative power of the evaluation.

        Questions to Address In Rebuttal

        1. Regarding Expressiveness: Please provide a concrete example of a common pointer-traversal algorithm from a real-world application (e.g., in graph analytics or complex database indexes) that cannot be implemented within PULSE's bounded computation model. How do you justify the "expressive" label given this significant constraint?
        2. Regarding Energy Claims: Please provide a robust justification for using a power scaling methodology from 2006 [95] to estimate ASIC performance for a modern system. Given the speculative nature of both the PULSE-ASIC and RPC-ARM energy figures, how can the paper's core claim of superior energy efficiency be considered valid?
        3. Regarding Distributed Traversal Benefit: The measured end-to-end latency improvement from the in-network switch appears to be around 15-20% in Figure 9, not the 50% reduction in network time claimed in Section 5. Please provide a latency breakdown that reconciles the theoretical benefit with the measured, and much smaller, end-to-end improvement. What are the dominant overheads that are not accounted for in your high-level claim?
        4. Regarding Scalability: The switch's global address translation table is a centralized resource. What is the upper bound on the number of distinct, fine-grained memory allocations this table can hold on current programmable switch hardware? At what scale (in terms of nodes or allocation frequency) would this table's capacity or update contention become the system's primary performance bottleneck? Your paper makes "rack-scale" claims that are not supported by the 4-node experiment; please provide evidence for this scalability.
        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-02 17:23:52.380Z

            Paper: PULSE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory
            Reviewer Persona: The Synthesizer (Contextual Analyst)


            Summary

            This paper identifies and addresses a fundamental performance bottleneck in disaggregated memory architectures: the high latency of pointer-chasing traversals across the network. The authors correctly argue that traditional CPU-side caching is ineffective for these workloads due to poor data locality.

            The core contribution is PULSE, a holistic, co-designed framework that offloads pointer traversals to lightweight accelerators at the memory nodes. The novelty of PULSE lies in its three synergistic components:

            1. An expressive iterator-based programming model that provides a general abstraction for various linked data structures while constraining computation to make hardware acceleration tractable.
            2. A novel disaggregated accelerator architecture where the logic and memory pipelines are decoupled and asymmetrically provisioned, efficiently matching the memory-bound nature of pointer chasing.
            3. An in-network continuation mechanism that leverages a programmable switch to seamlessly route traversal requests between memory nodes, handling distributed traversals without costly round-trips to the initiating CPU.

            The authors implement a prototype on FPGAs and a programmable switch, demonstrating significant end-to-end latency, throughput, and energy-efficiency gains (e.g., 9-34× lower latency than caching) for representative database, key-value store, and time-series workloads.

            Strengths

            This is an excellent systems paper that connects several important research threads into a cohesive and compelling solution.

            1. Addresses a Critical and Well-Understood Problem: The "pointer-chasing problem" is a well-known Achilles' heel for any system that separates compute from memory, whether it's traditional NUMA or modern memory disaggregation. By focusing on this specific, high-impact problem, the work has immediate relevance and positions itself as a key enabler for the widespread adoption of disaggregated memory. The empirical motivation in Section 2 (Page 3, Fig 2) effectively illustrates the severity and prevalence of the problem.

            2. Elegant, Principled Co-Design: The strength of PULSE is not in any single component, but in their synthesis.

              • The iterator abstraction (§3) is the right level of software interface. It maps to a familiar programming pattern, making it broadly applicable, while its "bounded computation" constraint is a pragmatic tradeoff that makes specialized hardware feasible.
              • The idea of a disaggregated accelerator (§4.2, Fig 4) is the paper's most insightful architectural contribution. Recognizing that the workload is asymmetric (brief computation tc, long memory wait td) and designing an accelerator with asymmetric resources (fewer logic pipelines, more memory pipelines) is a clever insight that directly attacks the core inefficiency of using general-purpose cores for this task.
              • Using the network switch for distributed continuations (§5) is an elegant solution to the distributed traversal problem. It reframes a distributed computation problem as a packet routing problem, leveraging the strengths of existing programmable network hardware and avoiding expensive CPU involvement.
            3. Connects Multiple Research Domains: This work sits at a beautiful intersection of memory disaggregation, near-memory processing (NMP), and programmable networking. It takes the "offload computation" philosophy from NMP but proposes a far more resource-efficient and generalizable architecture than prior work. It uses the programmable network not just as a transport, but as an active component of the distributed execution model. This synthesis provides a valuable new perspective on how to build efficient, rack-scale computer systems.

            4. Strong, End-to-End Evaluation: The evaluation is thorough and convincing. The authors build a real hardware prototype and compare it against a strong set of baselines, including caching, CPU-based RPC, and SmartNIC-based offloads. The results clearly demonstrate that their specialized, co-designed approach provides substantial benefits in performance and energy efficiency that cannot be achieved by any of the baseline approaches alone.

            Weaknesses

            My critiques are not focused on fundamental flaws but on exploring the boundaries of the proposed solution and its practical deployment.

            1. Generality and Limits of the Programming Model: The iterator model is very powerful, but the paper primarily uses simple examples like key lookups. It would be beneficial to discuss the model's limitations more explicitly. How does PULSE handle more complex traversals, such as those that might need to dynamically modify the structure they are traversing (e.g., rebalancing a tree, splicing a list)? Are transactional semantics or atomic updates beyond the scope of this model? A deeper exploration of these edge cases would help define the practical application space.

            2. System Complexity and Path to Adoption: PULSE is a holistic solution, which is a strength, but it also implies significant complexity. It requires a custom software toolchain (compiler from iterator to PULSE ISA), custom accelerators on memory nodes (FPGAs or ASICs), and a programmable network switch. This is a high barrier to entry. While the components are individually plausible, the paper could benefit from a brief discussion on the path to deployment. Could a subset of PULSE's benefits be achieved with, for example, just a SmartNIC-based accelerator without the programmable switch?

            3. Assumptions about the Network Fabric: The in-network continuation model relies on a programmable switch with a global view of memory allocation (albeit at a coarse granularity). The paper could clarify the scalability of this aspect. How does the system handle dynamic re-partitioning or re-allocation of memory ranges across nodes? Is there a risk that the translation table or forwarding logic in the switch could become a bottleneck or a point of complexity in a large-scale, dynamic environment?

            Questions to Address In Rebuttal

            1. Could you elaborate on the expressiveness of the iterator model for more complex, state-modifying traversals? For example, could PULSE be used to implement an operation like find_and_move_to_front in a linked list, which requires modifying next pointers during the traversal? If so, how would state consistency be managed?

            2. The distributed traversal mechanism is very elegant for reads. How does the model extend to handle distributed writes or atomic operations that might need to span multiple memory nodes? Does this necessarily require returning to the CPU for coordination?

            3. Regarding the system's robustness, the paper mentions retransmission on timeout for requests from the CPU. What is the failure model for a distributed traversal that is already in progress? For instance, if a request is forwarded from Memory Node A to B, and Node B fails or drops the packet, how does the original CPU learn of this failure and what is the recovery process?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-02 17:24:02.866Z

                Reviewer: The Innovator (Novelty Specialist)


                Summary

                This paper presents PULSE, a framework designed to accelerate pointer-traversal workloads in a disaggregated memory environment. The authors identify that existing solutions, such as CPU-side caching or single-node near-memory processing (NMP), are insufficient for distributed linked data structures. The central claim of novelty rests on a three-part system design: (1) an iterator-based programming model to provide an expressive software interface, (2) a novel "disaggregated" accelerator architecture at each memory node that decouples logic and memory pipelines for efficiency, and (3) a mechanism using a programmable network switch to enable stateful traversals to continue seamlessly across multiple memory nodes. The authors implement and evaluate a prototype, demonstrating significant performance and energy efficiency gains over baseline and RPC-based approaches.

                Strengths

                From a novelty perspective, the paper's primary strength is the coherent system design for distributed stateful traversals. The core innovative concept is the use of a programmable network switch to act as a router for in-flight pointer-chasing operations (Section 5, page 9). While NMP for pointer traversals and the use of programmable switches for network offloads are individually established concepts, their synthesis here to solve the multi-node traversal problem is genuinely novel. The mechanism of packaging the iterator state (cur_ptr, scratch_pad) into a request that can be forwarded by the switch to the next memory node, without returning to the host CPU, is an elegant and previously unexplored solution to the single-node limitation of prior NMP accelerators.

                A secondary, but still significant, novel contribution is the design of the PULSE accelerator itself (Section 4.2, page 7). The explicit decision to "disaggregate" the logic and memory pipelines within the accelerator (Figure 4, page 8) is a clever architectural insight. It directly addresses the memory-bound nature of these workloads, where tightly-coupled compute/memory resources in a traditional core design would lead to underutilization of the logic units. This design is a distinct and well-justified departure from both general-purpose cores (used in RPC schemes) and prior tightly-coupled specialized accelerators.

                Weaknesses

                The main weakness, in terms of novelty, is the framing of some well-established concepts as foundational contributions of this work.

                1. The Iterator Programming Model: The paper presents the iterator-based interface (Section 3, page 5) as a key design element. While it is a good engineering choice for creating a flexible hardware-software contract, the iterator pattern itself is a cornerstone of software engineering and is by no means novel. The contribution here is its application as an interface for an NMP accelerator, which is an incremental step rather than a fundamental innovation.

                2. The General Concept of NMP for Pointer Traversals: The paper correctly critiques prior work, but the idea of building specialized hardware close to memory to accelerate pointer chasing is not new. Seminal works like "Meet the Walkers" [90] and Hsieh et al. [76] explored this problem space in depth for in-memory databases and 3D-stacked memory, respectively. The authors' novelty is not in that they are building a pointer-chasing accelerator, but in the specific architecture of that accelerator (the disaggregated design) and, more importantly, its integration into a distributed system. The paper could be strengthened by more clearly positioning its work against these specific prior accelerator designs, rather than just against more generic RPC or caching systems. The current framing risks obscuring the true architectural novelty by re-litigating settled questions.

                In essence, the novelty of PULSE is not in its individual conceptual building blocks (iterators, NMP, programmable switches) but in their specific and sophisticated synthesis to create a new system capability: efficient, rack-scale distributed pointer traversal. The paper should be more precise in claiming this system-level synthesis as its core contribution.

                Questions to Address In Rebuttal

                1. Comparison to Prior Specialized Accelerators: The evaluation primarily compares PULSE to systems using general-purpose cores (RPC/RPC-ARM). A more rigorous assessment of novelty would compare the PULSE accelerator's disaggregated design to a specialized but coupled design, such as the one proposed in "Meet the Walkers" [90]. Could the authors provide a conceptual analysis (or even a model-based estimation) of how their disaggregated architecture compares in terms of area, power, and performance for a single-node traversal against such a design? This would better isolate and justify the claimed benefits of the novel disaggregated pipeline.

                2. Scope of the Iterator Abstraction: The scratch_pad provides a mechanism for stateful traversals. However, its fixed size appears to be a limitation. How does the PULSE model handle traversals where the intermediate state is unbounded or grows unpredictably, such as a Breadth-First Search (BFS) on a graph where the queue of nodes to visit can become very large? Does this represent a fundamental limit to the novelty of the approach, confining it to traversals with small, constant-sized state?

                3. Scalability of the Switch-Based Routing: The novel distributed traversal mechanism relies on a translation table in the programmable switch (Figure 6, page 9). This table maps virtual address ranges to physical memory nodes. What are the scalability limits of this approach? As the number of memory nodes and the granularity of allocations increase, this table could grow beyond the capacity of on-switch memory. Is the novelty of this mechanism constrained to a rack-scale system, or do the authors envision a path for scaling it further?