FAST:An FHE Accelerator for Scalable-parallelism with Tunable-bit
Link: https://dl.acm.org/doi/10.1145/3695053.3731407
Abstract: Fully Homomorphic Encryption (FHE) enables direct computation on encrypted data, providing substantial security advantages in cloud-based modern society. However, FHE suffers from significant computational overhead compared to plaintext computation, hindering its adoption in real-world applications. While many accelerators have been designed to address performance bottlenecks, most do not fully leverage cryptographic optimization technologies, leaving room for further performance enhancements.
In this work, we propose FAST, an FHE accelerator incorporating recent cryptographic optimizations, including hoisting technology and the gadget decomposition key-switching method (named KLSS method). We analyze ciphertext level consumption throughout application execution and observe that workload requirements vary significantly with different ciphertext levels for both hybrid and KLSS key-switching methods. Additionally, we note the differing computational precision requirements for these key-switching methods. Based on these observations, we designed a versatile framework that supports multiple key-switching methods during a single application execution and integrates hoisting technology. Furthermore, we developed a scalable, precision-tunable multiplier to accommodate the needs of hybrid and KLSS key-switching methods. FAST architecture features a specialized multiplier and novel data organization to exploit cryptographic optimizations effectively. To our knowledge, this is the first accelerator to support hoisting technology and the gadget decomposition key-switching method. Our solution achieves a significant performance improvement, averaging a 1.8 × speedup.
- KKaru Sankaralingam @karu
Review of the paper "FAST: An FHE Accelerator for Scalable-parallelism with Tunable-bit," written from the perspective of "The Guardian."
Review Form
Summary
This paper proposes FAST, a Fully Homomorphic Encryption (FHE) accelerator architecture. The authors' core thesis is that prior accelerators have failed to adequately support modern cryptographic optimizations. To remedy this, FAST is designed to incorporate "hoisting" technology and to support multiple key-switching methods within a single application, including the gadget-decomposition-based KLSS method. The architecture features a "Tunable-bit Multiplier" (TBM) to accommodate the varying precision requirements of these different methods. The authors claim this versatile, optimization-aware approach yields an average performance improvement of 1.8x over prior work.
Strengths
The paper is built on a correct and important observation that is often overlooked in hardware accelerator design.
- Correct Problem Identification: The paper correctly identifies that FHE accelerator performance is not just a matter of raw computational throughput, but is deeply tied to the ability to support the latest cryptographic optimizations (Section 1, Page 1). Recognizing that techniques like hoisting and advanced key-switching methods are critical for performance and designing hardware to support them is a valid and necessary direction for research.
- Thorough Workload Analysis: The analysis of ciphertext level consumption and the corresponding computational requirements for different key-switching methods is detailed and provides a solid motivation for the need for architectural versatility (Section 3, Page 3).
Weaknesses
Despite its valid premise, the paper's conclusions are founded on a logically flawed evaluation methodology, questionable design choices, and an insufficient analysis of critical system parameters.
- Fundamentally Unsound Baseline Comparison: The headline 1.8x speedup claim is invalid because the comparison is inequitable. FAST is specifically designed to leverage the KLSS key-switching method and hoisting technology. The baseline accelerator, however, is not. The paper claims to be the "first accelerator to support" these techniques (Abstract, Page 1). Therefore, the performance improvement is not a demonstration of FAST's architectural superiority, but is merely the predictable result of comparing a system running an advanced, more efficient cryptographic algorithm against a baseline running an older, less efficient one. The speedup comes from the cryptographic method, not the hardware architecture.
- Insufficient Justification for Tunable Precision: The paper proposes a complex "Tunable-bit Multiplier" (TBM) to handle varying precision requirements (Section 5.4, Page 8). However, the justification for this complexity is weak. The paper does not provide a rigorous analysis of the area, power, and timing overhead of this tunable datapath compared to a simpler, fixed-precision datapath designed for the worst-case precision required by any of the supported methods. It is highly plausible that the overhead of the tunable logic negates its benefits, and a fixed-precision design would be more efficient overall. The performance gains attributed to the TBM (Figure 12, Page 12) are not isolated from other architectural changes, making it impossible to assess its true contribution.
- Inadequate Analysis of Precision and Noise: The paper states that it adopts a 36-bit datapath, claiming this allows for "acceptable accuracy loss" by citing a single prior work (Section 5.7.1, Page 9). This is insufficient. FHE computations are notoriously sensitive to noise growth, which is dependent on the chosen parameters and the depth of the computation. A single, fixed bit-width may be insufficient for more complex applications or future FHE schemes that require larger moduli. The paper lacks a rigorous error propagation analysis to prove that 36 bits is a robust choice for a general-purpose FHE accelerator.
- "Versatility" is Asserted, Not Proven: The architecture claims to efficiently support multiple key-switching methods within a single execution. However, the evaluation does not include any workloads that dynamically and frequently switch between methods. The benefit of this hardware versatility is therefore purely theoretical. It is possible that the architecture is implicitly optimized for one method (e.g., KLSS) and would perform poorly on a workload dominated by another (e.g., hybrid), making the claim of general versatility unsubstantiated.
Questions to Address In Rebuttal
- To provide a sound comparison, please evaluate FAST against a baseline accelerator that is also architecturally optimized to support the KLSS key-switching method and hoisting. This is the only way to isolate the performance benefits of your specific microarchitecture from the benefits of the underlying cryptographic algorithm.
- Provide a detailed breakdown of the area, power, and critical path delay of your Tunable-bit Multiplier (TBM). How do these metrics compare to a fixed-precision multiplier designed for the maximum precision required by any of the key-switching methods you support?
- Please provide a detailed noise analysis for a range of FHE applications with varying circuit depths. Demonstrate that a 36-bit datapath is sufficient to maintain correctness for applications requiring larger parameter sets and deeper computations than those presented in your evaluation.
- To substantiate your claim of versatility, please create and evaluate a benchmark application that dynamically and frequently interleaves operations requiring the hybrid key-switching method with operations requiring the KLSS method. This is necessary to prove that your architecture efficiently supports this dynamic switching and that the added complexity is justified.
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Review of the paper "FAST: An FHE Accelerator for Scalable-parallelism with Tunable-bit," written from the perspective of "The Synthesizer."
Review Form
Summary
This paper introduces FAST, a hardware accelerator architecture designed to efficiently execute Fully Homomorphic Encryption (FHE) workloads. The core contribution is an architectural philosophy that explicitly embraces and supports modern cryptographic optimizations, which have often been ignored by prior hardware designs. Specifically, FAST is the first accelerator designed to support "hoisting" technology and multiple, dynamically-chosen key-switching methods (including the advanced gadget-decomposition-based KLSS method) within a single application. To enable this versatility, the architecture features a "Tunable-bit Multiplier" (TBM) that can adapt its datapath precision to the varying requirements of these different cryptographic techniques. The authors claim this co-design approach, which aligns the hardware with the latest software-level optimizations, results in significant performance gains over existing FHE accelerators.
Strengths
This paper represents a crucial and sophisticated step forward in the co-evolution of FHE cryptography and hardware acceleration. Its strength lies in its deep understanding of the software-hardware interface and its commitment to bridging the gap between them.
- Co-designing with Modern Cryptography: The most significant contribution of this work is its recognition that FHE acceleration is not merely a matter of building faster multipliers. The paper correctly identifies that true performance gains come from supporting the latest cryptographic primitives and optimization techniques (Section 1, Page 1). By explicitly designing the architecture to handle advanced methods like hoisting and the KLSS key-switching scheme, this work moves the field beyond naive hardware acceleration and toward a more mature, co-designed system. This is a vital step for making FHE practical. 🤝
- Embracing Algorithmic Diversity: The design of FAST acknowledges a key reality of modern FHE: there is no "one-size-fits-all" cryptographic technique. Different operations have different noise characteristics and computational costs. The decision to support multiple key-switching methods within a single architecture (Section 3.2, Page 4) and to provide a flexible datapath via the Tunable-bit Multiplier (Section 5.4, Page 8) is a forward-looking one. It allows the hardware to adapt to the needs of the application, rather than forcing the application to conform to a rigid hardware design. This is a hallmark of a well-designed, domain-specific accelerator.
- Connecting to the Broader Accelerator Landscape: While focused on FHE, this work taps into a broader trend in computer architecture: the move towards more versatile, reconfigurable, and "application-aware" hardware. The Tunable-bit Multiplier, for example, shares a conceptual lineage with the block floating-point and mixed-precision execution units found in modern AI and graphics hardware. By applying these ideas to the domain of FHE, the paper enriches the conversation in both fields and demonstrates the universal importance of data-path flexibility.
Weaknesses
While the vision of a crypto-aware accelerator is powerful, the paper could be strengthened by exploring the longer-term implications and challenges of its design choices.
- The "Chicken and Egg" Problem of Co-design: The paper designs an accelerator for today's most advanced cryptographic techniques. However, the field of FHE is evolving at a breathtaking pace. New schemes and optimizations are constantly being proposed. A potential weakness is that by optimizing for the current state of the art (like KLSS), the architecture may become suboptimal for the next generation of cryptographic methods. A more in-depth discussion of the architecture's "future-proofing" would be valuable.
- The Compiler and Toolchain Challenge: The paper focuses on the hardware architecture, but the true potential of FAST can only be unlocked by a sophisticated compiler and software toolchain. Such a toolchain would need to be able to analyze an FHE program, automatically choose the optimal key-switching method for each operation, and generate code that effectively utilizes the tunable-bit hardware. The paper acknowledges this need (Section 2, Page 3), but a deeper exploration of the compiler's role and the challenges in building it would provide a more complete picture of the proposed ecosystem.
- Data Movement Bottlenecks: Like all accelerators, FAST is ultimately constrained by the memory system. The paper provides a solid analysis of the on-chip memory hierarchy but spends less time discussing the challenges of off-chip data movement. As FHE applications grow in complexity, the sheer volume of ciphertext data will make the interface to external memory a critical performance bottleneck. A discussion of how FAST's design philosophy could be extended to address the off-chip memory challenge would be a welcome addition.
Questions to Address In Rebuttal
- Your work brilliantly optimizes for the current state of FHE cryptography. How do you envision the FAST architecture evolving to support future, as-yet-unknown FHE schemes or optimizations? What are the key principles of your design that provide this forward compatibility?
- The versatility of FAST places a significant burden on the compiler. What are the most difficult challenges in building a compiler that can automatically and optimally leverage the capabilities of your architecture, particularly the dynamic choice of key-switching methods? 🤓
- Looking beyond a single chip, how would you scale the FAST architecture to a multi-chip or multi-server system for even larger FHE applications? What new challenges and opportunities would arise at that scale?
- The tunable-precision datapath is a key feature. Could this idea be pushed even further? For instance, could future FHE accelerators use a more dynamic, fine-grained precision, perhaps even on a per-operation basis, to further optimize the trade-off between noise, precision, and performance?
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Review of the paper "FAST: An FHE Accelerator for Scalable-parallelism with Tunable-bit," written from the perspective of "The Innovator."
Review Form
Summary
This paper presents FAST, a Fully Homomorphic Encryption (FHE) accelerator. The core claims to novelty are twofold: 1) It is the first FHE accelerator architecture explicitly co-designed to support modern cryptographic software optimizations, namely "hoisting" and multiple key-switching methods within a single application, including the advanced gadget-decomposition-based KLSS method (Abstract, Page 1). 2) To enable this, it proposes a novel "Tunable-bit Multiplier" (TBM), a datapath that can dynamically adjust its precision to match the varying requirements of these different cryptographic techniques (Section 5.4, Page 8). The central thesis is that this co-design philosophy represents a new, more efficient approach to FHE acceleration.
Strengths
From a novelty perspective, this paper's contribution is not the invention of a single new primitive but rather the novel synthesis and co-design of hardware and software concepts.
- A Novel Architectural Philosophy: The most significant "delta" in this work is its departure from the prevailing design philosophy of prior FHE accelerators. While previous works focused on accelerating the raw mathematical kernels (e.g., NTT, polynomial multiplication), FAST is the first to propose an architecture built from the ground up to support the structure of modern, highly-optimized FHE software stacks (Section 1, Page 1). This shift from a purely performance-driven to a crypto-optimization-aware design methodology is a crucial and novel step in the maturation of the field. It reframes the problem from "how to build a faster NTT unit" to "how to build a system that best leverages the latest cryptographic libraries."
- Novel Application of Tunable Precision: While variable-precision hardware is a known concept in other domains (e.g., mixed-precision units in AI accelerators), its application in FAST to solve a specific FHE challenge is a novel adaptation. The Tunable-bit Multiplier (TBM) is not just a generic variable-precision unit; it is specifically designed to handle the distinct precision requirements of different key-switching methods used within a single FHE program (Figure 6, Page 8). This targeted application of a known hardware technique to solve a new, domain-specific problem is a clear and innovative contribution. 🎯
Weaknesses
While the co-design philosophy is novel, the work's claims are diluted because the underlying components and the source of its performance gains are not fundamentally new.
- Performance Gains Stem from Algorithms, Not Novel Hardware Principles: The headline 1.8x speedup is primarily a result of comparing a system running a more advanced cryptographic algorithm (KLSS) to a baseline that is not. The paper claims to be the "first" accelerator to support these methods. Therefore, the performance improvement does not stem from a novel architectural breakthrough but from being the first to implement a better known software algorithm in hardware. The novelty of the architecture itself is conflated with the novelty of the cryptography it supports.
- Component Building Blocks are Not New: The FAST architecture is built from standard components: PE arrays, schedulers, and on-chip memory hierarchies. The dataflow and scheduling mechanisms, while tailored to the FHE workload, do not appear to represent a fundamentally new approach to dataflow architecture. The TBM, as noted, is a clever adaptation but not a ground-up invention. The novelty is purely in the synthesis and specific application, not in the creation of new hardware primitives.
- "First to Support" is a Narrow Claim: The claim of being the "first" accelerator to support hoisting and KLSS is a strong but narrow one. It establishes a new design point but does not, in itself, represent a fundamental new principle of acceleration. The work demonstrates the benefit of supporting these features, but the novelty lies in being the first to do so, not in a paradigm-shifting architectural concept that would be applicable beyond these specific cryptographic techniques.
Questions to Address In Rebuttal
- The Tunable-bit Multiplier (TBM) is a central component. Can you precisely articulate its novel aspects compared to the variable-precision and mixed-precision datapaths that have been extensively studied in the AI and DSP accelerator domains?
- The performance benefit of FAST seems inextricably linked to its use of the superior KLSS algorithm. To isolate the novelty of the architecture, what is the performance of FAST compared to a baseline accelerator that also supports KLSS, but is perhaps architected differently (e.g., with a fixed-precision datapath)? This is necessary to prove that the architectural choices in FAST, and not just the algorithm, are the source of the benefit.
- Beyond supporting specific known optimizations like hoisting, what is the fundamental, generalizable architectural principle in FAST that you believe is novel? If a new FHE optimization were to be invented tomorrow, why would the FAST architecture be inherently better at adapting to it than prior designs?
- Is the novelty of the FAST scheduler (Section 5.3, Page 7) in its ability to handle the specific data dependencies of hoisting, or does it represent a more general, novel approach to scheduling dataflow computations that could be applied to other application domains?