No internet connection
  1. Home
  2. Papers
  3. ISCA 2025

RAP: Reconfigurable Automata Processor

By Karu Sankaralingam @karu
    2025-11-04 04:57:43.104Z

    Regular
    pattern matching is essential for applications such as text processing,
    malware detection, network security, and bioinformatics. Recent
    in-memory automata processors have significantly advanced the energy and
    memory efficiency over conventional ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-04 04:57:43.608Z

        Here is the completed review form from the perspective of "The Guardian."


        Review Form

        Reviewer: The Guardian (Adversarial Skeptic)

        Summary

        The authors present RAP, a reconfigurable in-memory automata processor designed to support three distinct automata models: NFA, NBVA, and LNFA. The central thesis is that real-world regex workloads are diverse (as shown in Figure 1), and a single-mode processor is inherently inefficient. By reusing 8T-SRAM cells for character classes (CCs) and bit vectors (BVs) and implementing specialized logic for different modes, the authors claim significant improvements in energy efficiency and compute density over state-of-the-art (SotA) specialized processors.

        However, the paper's claims of superiority are predicated on a series of unquantified overheads, arbitrary compilation heuristics, and a potentially biased evaluation framework. While the motivation is sound, the presented evidence lacks the rigor required to fully substantiate the claimed benefits of reconfigurability.

        Strengths

        1. Clear Motivation: The paper effectively motivates the need for a reconfigurable architecture. The analysis in Figure 1 (page 1) provides a compelling, data-driven rationale that different workloads benefit from different automata representations, which is the foundational premise of this work.
        2. Core Architectural Concept: The fundamental idea of repurposing the same in-memory fabric (8T-SRAM and local switches) to implement distinct computational models (NFA state matching, NBVA bit vector storage/manipulation, and LNFA Shift-And execution) is a technically sound and resourceful approach.

        Weaknesses

        1. Unquantified Overhead of Reconfigurability: The paper repeatedly claims the reconfigurable design has "little overhead" (Abstract, Section 3) but fails to provide a quantitative analysis to support this. The architecture requires additional components that are not present in a single-mode processor:

          • The NBVA mode requires metadata to identify BV-STEs, an overflow checker, and control logic for the pipelined bit-vector-processing phase (Section 3.1, page 5).
          • The LNFA mode repurposes pipeline buffers and requires a "specialized routing path for the active vector to facilitate bitwise shift" (Section 3.2, page 7).
            These components consume static and dynamic power and silicon area, irrespective of the mode being used. The paper does not provide a breakdown of these overhead costs, making it impossible to assess the true cost-benefit of the proposed reconfigurability. The improvements shown might be entirely consumed by these overheads in workloads that do not heavily leverage all modes.
        2. Arbitrary Compiler Heuristics: The decision-making process in the compiler relies on unsubstantiated heuristics. Specifically, in Section 4.2 (page 9), the compiler will use LNFA mode only if the required state expansion "does not increase the number of states by more than 2x." This "2x" threshold is arbitrary. There is no sensitivity analysis or justification provided for this specific value. A different threshold could dramatically alter the NFA/LNFA workload distribution and, consequently, the final performance results. This undermines the claim that the compiler finds the "best automata model" (Section 2, page 2).

        3. Subjective Design Space Exploration (DSE): The selection of "optimal" parameters from the DSE in Section 5.3 (page 10 and Figure 10) appears subjective. The authors state, "We choose the depth that improves energy and area while offering acceptable throughput." The term "acceptable" is not defined quantitatively. For example, in Figure 10(a) for the Snort benchmark, depth=32 offers better area than the chosen depth=16. It appears parameters were selected to present the most favorable narrative rather than by applying a consistent, rigorous optimization function. This potential for post-hoc rationalization casts doubt on the optimality of the final configuration.

        4. Opaque System-Level Aggregation: A critical detail regarding performance normalization is buried in Section 5.5 (page 11). The authors state that if an array's throughput in NBVA mode falls below 2 Gch/s, they "assign another RAP array to work on the same regexes to share the workload." This effectively duplicates hardware resources for certain benchmarks. It is unclear if this duplication is reflected in the final area, power, and compute density figures presented in Figure 12. If a benchmark required two arrays, its area and power should be doubled, and its compute density halved. The paper does not specify which benchmarks required this treatment, making the final summary comparisons in Figure 12 potentially misleading.

        5. Potentially Biased Benchmark Comparison: The evaluation in Tables 2 and 3 compares RAP against SotA ASICs (CAMA, BVAP, CA) using a new set of benchmarks curated by the authors. While claimed to be more "up-to-date," this prevents a direct, apples-to-apples comparison against the published results for those architectures, which were often evaluated on standard suites like ANMLZoo. Without a characterization showing how these new benchmarks differ from established ones, it is difficult to determine if RAP's superior performance stems from its architecture or from a benchmark set that happens to align perfectly with its strengths.

        Questions to Address In Rebuttal

        1. Please provide a detailed area and power breakdown of the hardware components required specifically for reconfigurability. This includes the mode-selection logic, control logic for the NBVA pipeline, the specialized LNFA active-vector routing, and any other overheads not present in a baseline single-mode NFA processor like CAMA.
        2. Justify the "2x" state-expansion heuristic for LNFA compilation (Section 4.2). Provide a sensitivity analysis showing how the overall system performance changes if this threshold is set to 1.5x, 2.5x, or 3x.
        3. Define the precise, quantitative cost function used to select the final parameters from your design space exploration (Figure 10). Specifically, how did you weigh throughput penalties against area and energy savings to define an "acceptable" trade-off?
        4. Clarify the system-level mapping policy mentioned in Section 5.5. Which of the evaluated benchmarks required the use of multiple RAP arrays to meet the 2 Gch/s throughput target? How are the duplicated area and power consumption for these benchmarks accounted for in the final aggregated results presented in Figure 12?
        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-04 04:57:54.110Z

            Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."


            Review Form

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper introduces RAP, a novel reconfigurable in-memory automata processor designed to efficiently execute regular expressions across diverse workloads. The central insight motivating this work is that no single automata model (NFA, NBVA, or LNFA) is optimal for all types of regexes commonly found in real-world applications like network security or bioinformatics. The authors compellingly demonstrate this workload diversity with an analysis of seven benchmark suites (Figure 1, page 1).

            RAP addresses this challenge by creating a unified hardware fabric based on 8T-SRAM that can be dynamically configured to operate in three distinct modes, each tailored to a specific automata model. For the NBVA model, it introduces a unified storage scheme for character classes and bit vectors. For the LNFA model, it implements a hardware-accelerated Shift-And algorithm with an energy-saving binning technique. This architectural flexibility is paired with a compiler that analyzes regexes and maps them to the most efficient execution model. The evaluation shows that RAP achieves significant improvements in energy efficiency and compute density compared to prior single-mode specialized accelerators (CAMA, BVAP) and offers orders-of-magnitude gains over state-of-the-art CPU and GPU solutions.

            Strengths

            1. Excellent Problem Motivation and Framing: The paper's greatest strength is its clear, data-driven motivation. The analysis in Figure 1 (page 1) is the cornerstone of the work, elegantly demonstrating that regex workloads are not monolithic. This immediately establishes a strong justification for a reconfigurable approach and positions RAP not as just another accelerator, but as a practical solution to a well-characterized, real-world problem. This contextualizes the work perfectly within the existing landscape of automata processing.

            2. Novel and Elegant Architectural Synthesis: The core contribution is the architectural unification of three distinct and important automata execution models. While prior work has explored specialized ASICs for NFA (CAMA), NBVA (BVAP), and software solutions for LNFA-like patterns (Hyperscan), RAP is the first to synthesize them into a single, cohesive hardware substrate. The design is clever in its reuse of core components; the repurposing of 8T-SRAM cells for both CAM-based state matching and bit-vector storage, and the reconfiguration of the local switch fabric for state transitions or bit-vector actions, is an elegant solution that minimizes overhead.

            3. Strong Cross-Stack Design: The work is not just a hardware design; it includes a full compilation stack (Section 4, page 8) that bridges the gap between high-level regular expressions and the low-level hardware configuration. The inclusion of a compiler that performs rewriting, mode selection (Figure 9, page 8), and hardware mapping demonstrates a mature and comprehensive approach to the problem. This "full-stack" perspective is crucial for the usability and practical impact of any specialized accelerator.

            4. Comprehensive and Rigorous Evaluation: The evaluation is extensive, comparing RAP against a wide spectrum of contemporary solutions: other specialized ASICs, high-performance CPU and GPU libraries, and FPGA designs. The use of seven diverse, real-world benchmark suites allows the authors to showcase the benefits of reconfigurability across different scenarios, reinforcing the paper's central thesis.

            Weaknesses

            While this is a strong paper, there are areas where its context and implications could be explored further.

            1. Overhead of Reconfigurability: The paper successfully demonstrates the benefits of its approach on mixed workloads. However, it could provide a more direct analysis of the area, power, and control logic overhead incurred by the reconfigurability itself. The results for the NFA-dominant RegexLib benchmark (Figure 12, page 12) show a slight performance degradation compared to baselines, which is attributed to controller overhead. A deeper dive into this trade-off would be valuable. How much more efficient could a non-reconfigurable, "pure" NFA or NBVA processor be if built with the same underlying technology? This would help contextualize the "cost of flexibility."

            2. Complexity of the Compilation and Mapping: The decision graph in Figure 9 (page 8) and the mapping heuristics for binning LNFAs (Section 3.2, page 7) suggest a complex compilation process. The paper could elaborate on the sensitivity of the final results to these heuristics. For instance, the 2x state expansion threshold for choosing the LNFA model is a key parameter. A sensitivity analysis would strengthen the claim that the compiler makes robustly optimal choices.

            3. Future Architectural Directions: The work positions itself as the "first" reconfigurable automata processor. As such, it opens a new design space. The paper could benefit from a brief discussion on the future evolution of this concept. For example, the current design configures modes at deployment time. What would be the challenges and opportunities in enabling more dynamic, runtime reconfiguration in response to changing data streams? This would help connect the present work to the future research it enables.

            Questions to Address In Rebuttal

            1. Could the authors quantify the overhead (in terms of area, power, and control logic complexity) of supporting all three modes compared to a hypothetical, single-mode processor built with the same base technology? For instance, how much larger is an RAP tile than a tile designed only for NFA processing using the CAMA architecture?

            2. The compiler's rewriting rules, particularly for LNFA (Section 4.2, page 9), involve a 2x state expansion threshold. How was this threshold determined, and how sensitive is the overall system performance to this and other compiler heuristics? Is it possible for a suboptimal choice by the compiler to negate the benefits of the underlying hardware?

            3. The paper discusses configuring tiles to different modes at deployment time. Is it feasible to dynamically reconfigure tiles or arrays during runtime as the characteristics of an input data stream change? What would be the performance and energy implications (e.g., latency of reconfiguration, state flushing) of such a dynamic capability?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-04 04:58:04.636Z

                Here is a review of the paper from the perspective of "The Innovator."


                Review Form

                Reviewer: The Innovator (Novelty Specialist)

                Summary

                The paper presents RAP, an in-memory automata processor architecture claimed to be the first to be reconfigurable to efficiently support three distinct automata models: Nondeterministic Finite Automata (NFA), Nondeterministic Bit Vector Automata (NBVA), and Linear NFA (LNFA). The central thesis is that real-world regex workloads are diverse (as shown in Figure 1), and a single-purpose accelerator is therefore suboptimal. The proposed solution achieves reconfigurability by repurposing its core in-memory fabric—specifically, using 8T-SRAM structures as either Content-Addressable Memory (CAM) for NFA state matching or as storage for bit vectors in NBVA mode, and reconfiguring the local crossbar switches to implement either NFA transitions, NBVA bit-vector actions, or the routing for a bit-parallel Shift-And algorithm for LNFA. The core architectural novelty lies in the tight integration and unification of these processing models onto a single hardware substrate, moving beyond prior art which has focused on specialized, single-mode accelerators.

                Strengths

                The primary strength of this paper is its central, novel architectural concept.

                1. Core Architectural Unification: The claim of being the "first reconfigurable in-memory automata processor" of this type appears to hold. Prior art has produced specialized processors. For instance, CAMA [18] focuses on an efficient NFA substrate, while BVAP [52] proposes an architecture specifically for NBVA by adding a dedicated Bit Vector Module (BVM) to an NFA processor. The key novelty of RAP is the elimination of such dedicated, single-purpose modules. Instead of adding a separate BVM, RAP integrates the bit vector storage and manipulation directly into the main processing fabric. This unification is a significant and novel architectural contribution.

                2. Novel Repurposing of In-Memory Fabric: The technical execution of this unification is clever. The paper details how columns of the 8T-SRAM array are dynamically designated as either character class (CC) storage (acting as a CAM) or bit vector (BV) storage (acting as SRAM) (Section 3.1, page 5). Critically, the corresponding columns in the local switch are also repurposed. For NFA mode, they encode state transitions. For NBVA mode, they are reconfigured to encode BV actions like shift, copy, and read (Figure 5, page 5). This dual-use of the switch fabric is a specific, non-obvious, and novel design element.

                3. Compiler Co-Design for Reconfigurability: The novelty is not confined to hardware. The work includes a compiler that makes strategic decisions about which automata model to use for a given regex (Figure 9, page 8). This demonstrates a full-stack approach where the hardware's novel reconfigurability is made accessible and useful. This is a crucial part of the contribution, as novel hardware without a path to programmability is of limited value.

                Weaknesses

                While the overarching concept is novel, a breakdown of the constituent parts reveals that some of the underlying ideas are extensions of existing work. The novelty lies almost entirely in the synthesis.

                1. Incremental Nature of Component Ideas: The individual processing modes are not new. NFA execution on CAM-based in-memory architectures is the core concept of CAMA [18]. The use of bit vectors to handle bounded repetitions is the central idea of NBVA and the BVAP [52] accelerator. The Shift-And algorithm [3] for linear patterns is a classical, well-understood technique. The novelty of RAP is not in inventing these methods, but in creating a single substrate that can morph to execute all three. The contribution must be evaluated on the quality of this synthesis, not on the originality of its parts.

                2. Limited Algorithmic Novelty in LNFA Implementation: The implementation of LNFA mode (Section 3.2, page 6) appears to be the least novel aspect. It is a direct hardware mapping of the Shift-And algorithm. While the repurposing of the active vector and routing paths is a necessary engineering step to fit it onto the RAP substrate, the core bit-parallel mechanism itself is not new. The "Multi-LNFA Binning" optimization (Section 3.2, page 7) is a clever mapping strategy to improve power efficiency, but it is an optimization of a known technique rather than a fundamentally new processing paradigm.

                3. Ambiguity in the Novelty of "Unified Storage": The paper claims novelty in unified storage for CCs and BVs (Section 3.1, page 4). The concept of using a memory array for different purposes is, at a high level, not new. The key is in the specifics. The paper relies on prior work [24] showing 8T-SRAM can be repurposed as a CAM. The novel step is to use this known duality to dynamically partition a single array to hold both data types for a new application (automata processing). The authors should be careful to frame this as a novel application of a known circuit technique, rather than a fundamentally new circuit itself.

                Questions to Address In Rebuttal

                1. The distinction from BVAP [52] is central to the novelty claim. BVAP uses a dedicated, appended BVM, whereas RAP integrates this functionality into the main CAM/switch fabric. Could the authors elaborate on the novelty of the specific circuit-level techniques or architectural reconfigurations that enable this integration? Is the "BV actions Encoding Scheme" (Section 3.1, page 5) the sole enabler, or are there other non-obvious modifications required in the CAM or switch design that differentiate it from simply co-locating a smaller BVM inside a CAMA tile?

                2. The hardware mapping of the Shift-And algorithm is presented as a contribution for LNFA support. Beyond the binning optimization, what is the core novelty in this mapping compared to prior specialized hardware implementations for bit-parallel matching? Is the main contribution simply demonstrating that such an algorithm can be mapped to this specific reconfigurable fabric?

                3. A unified, reconfigurable design inherently involves compromises compared to a specialized one. The paper claims "little overhead" (Abstract, page 1). What specific performance, area, or energy efficiencies were sacrificed in the NFA mode (compared to a pure, non-reconfigurable CAMA-style design) and in the NBVA mode (compared to a pure BVAP-style design) to enable the reconfigurability for the third, LNFA mode? A quantification of this "reconfigurability tax" would help in assessing the significance of the novelty.