UniZK: Accelerating Zero-Knowledge Proof with Unified Hardware and Flexible Kernel Mapping
Zero-
knowledge proof (ZKP) is an important cryptographic tool that sees wide
applications in real-world scenarios where privacy must be protected,
including privacy-preserving blockchains and zero-knowledge machine
learning. Existing ZKP acceleration ...ACM DL Link
- KKaru Sankaralingam @karu
Reviewer Persona: The Guardian (Adversarial Skeptic)
Summary
The authors present UniZK, a hardware accelerator for modern, hash-based Zero-Knowledge Proof (ZKP) protocols such as Plonky2 and Starky. The central thesis is that emerging ZKP protocols contain diverse computational kernels (NTT, hash, polynomial operations) that render specialized, dedicated hardware units inefficient. The proposed solution is a "unified" hardware architecture based on multiple vector-systolic arrays (VSAs) of processing elements (PEs). The paper's main contribution lies in the proposed strategies for mapping these diverse kernels onto this ostensibly general VSA architecture. The authors evaluate their design using a cycle-accurate simulator and claim significant speedups (97x and 46x on average) over highly-parallel CPU and GPU implementations, respectively.
Strengths
- Problem Formulation: The paper correctly identifies a relevant and timely problem. As ZKP protocols evolve beyond classic elliptic-curve constructions, the proliferation of diverse computational kernels does indeed pose a challenge for hardware acceleration. The motivation to move away from a collection of disparate, specialized hardware blocks towards a more unified compute fabric is logical.
- Breadth of Kernels Addressed: The work attempts to provide a holistic acceleration solution, covering the most time-consuming parts of the Plonky2 protocol, including NTTs, Poseidon hashing (for Merkle trees and other components), and various polynomial operations. This end-to-end approach is commendable in principle.
- Detailed Kernel Mapping for Poseidon: The mapping strategy for the irregular Poseidon hash function onto the systolic array (Section 5.2, page 8) is intricate and demonstrates a detailed understanding of the algorithm's dataflow. The use of custom PE links to handle the specific requirements of the partial rounds is a core part of their technical contribution.
Weaknesses
My analysis finds several critical issues regarding the core claims, experimental methodology, and the conclusions drawn. The work appears to suffer from an overstatement of generality and questionable evaluation choices that inflate the reported performance benefits.
-
Contradiction in Core Motivation vs. Results: The primary motivation for UniZK is to create a unified architecture that avoids the low resource utilization of having dedicated units for each kernel. However, the authors' own results in Table 4 (page 11) directly undermine this premise. The VSA utilization for NTT and Polynomial kernels is extremely low (ranging from 2.0% to 9.2%), while only the Hash kernels achieve high utilization (>95%). Given that polynomial and NTT operations constitute a significant portion of the workload (as seen in Figure 8), the expensive VSA hardware is demonstrably underutilized for the majority of the execution time. This suggests the architecture is not truly "unified" or efficient for all target kernels, but rather is a hash accelerator that can also execute other kernels poorly.
-
Unfair and Misleading Baseline Comparisons: The claimed speedups are built upon questionable baseline comparisons.
- Hobbled GPU Baseline: The authors explicitly state, "The other kernels are still executed on the host CPU" for the GPU baseline (Section 6, page 10). This is not a fair comparison. A state-of-the-art A100 GPU is severely handicapped if it is bottlenecked by frequent data transfers to and from the CPU for unaccelerated kernels. The reported 46x speedup over the GPU is likely an artifact of this unoptimized baseline rather than a true measure of UniZK's superiority over a properly engineered GPU solution.
- Sensationalist Comparison with PipeZK: The comparison in Section 7.5 (page 12) is an egregious "apples-to-oranges" comparison. It compares UniZK running a modern batch-processed Starky proof against PipeZK running an older, single-instance Groth16 proof. The protocols are fundamentally different in their structure and performance characteristics. Claiming an 840x speedup by comparing batch throughput to single-instance latency is misleading and appears designed to generate a headline number rather than provide a meaningful scientific comparison.
-
Questionable Generality of the Architecture and Mappings: The paper claims the VSA architecture is "simple and general" (Section 3, page 5), but the mapping strategies suggest otherwise.
- The Poseidon hash mapping (Section 5.2) relies on "newly added reverse links" and a specific 12xN array size to match Poseidon's 12-element state. How would this mapping adapt if the protocol switched to a different hash function with a different state size or a different sparse matrix structure? The design seems brittle and tailored specifically to Poseidon.
- The partial product mapping (Figure 6, page 9) is also highly specific to the 8-element chunking structure. The claim of generality is not sufficiently substantiated.
-
Insufficient Architectural Details and Analysis: The description of the "vector mode" and "extra local links" is high-level. What is the precise area, power, and timing overhead of these VSA enhancements compared to a standard systolic array? The paper presents overall power numbers in Table 2 (page 10), but lacks a comparative analysis that would justify these specific architectural choices over simpler alternatives. For instance, would a simpler array of independent PEs with a more flexible interconnect have achieved better utilization for the polynomial kernels?
Questions to Address In Rebuttal
-
Please reconcile the central motivation of high resource utilization with your own results in Table 4, which show VSA utilization is below 10% for two of the three major kernel categories (NTT and Polynomials). How can the architecture be considered "efficiently unified" when the primary compute resources are idle for large portions of the workload?
-
Can you defend the fairness of the GPU baseline comparison? A truly rigorous comparison would require an optimized GPU implementation where all major kernels are accelerated on the device. Please provide an argument for why your current comparison, which involves frequent host-device interaction for the baseline, is a valid methodology for claiming a 46x speedup.
-
The 840x speedup claim over PipeZK is derived from comparing the batch throughput of UniZK (Starky) to the single-instance latency of PipeZK (Groth16). Please justify why this is a scientifically sound comparison. Alternatively, provide a more direct, latency-based comparison on a single proof instance for both accelerators, even if the protocols differ.
-
The Poseidon mapping is tied to its 12-element state. How would your "general" architecture and mapping strategy adapt to a future hash-based ZKP protocol that uses a different hash function, for example, one with a 16-element state and a different MDS matrix structure? Please provide concrete details on how the VSA and the mapping would change.
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Reviewer: The Synthesizer (Contextual Analyst)
Summary
This paper presents UniZK, a hardware accelerator designed for modern, hash-based zero-knowledge proof (ZKP) systems like Plonky2 and Starky. The central problem the authors identify is that unlike older, elliptic-curve-based ZKPs which are dominated by a few expensive kernels, modern hash-based protocols feature a diverse and evolving set of computationally significant kernels (NTT, hash functions, various polynomial operations). A design with dedicated hardware for each kernel would be inefficient and inflexible.
The core contribution of this work is the application of a unified, flexible hardware paradigm to this problem. The authors propose a systolic-array-based architecture, enhanced with specific features for ZKP, and then develop novel mapping strategies to execute these diverse kernels efficiently on the same hardware fabric. This approach is explicitly and insightfully analogized to the evolution of AI accelerators, which moved from specialized units to more general dataflow architectures like systolic arrays to handle the growing diversity of neural network layers. The paper provides a detailed hardware design, comprehensive mapping techniques for key kernels, and a thorough evaluation demonstrating significant speedups over high-performance CPU and GPU baselines, as well as prior specialized ZKP accelerators.
Strengths
The primary strength of this paper is its architectural philosophy and the compelling way it positions this work within the broader context of domain-specific acceleration.
-
The "Systolic Array for Crypto" Paradigm: The most significant contribution is the recognition that the trajectory of ZKP acceleration is mirroring that of AI/ML acceleration. Just as architectures like the Google TPU [33] used systolic arrays to provide a unified, high-efficiency substrate for diverse tensor operations (convolutions, matrix multiplies), UniZK does the same for the core primitives of modern ZKP (polynomial multiplication via NTT, hashing via matrix-vector operations, etc.). This is a powerful and timely insight that elevates the paper from a mere point solution to a potential blueprint for future ZKP hardware. The authors correctly identify this parallel in their Design Philosophy (Section 3, page 5).
-
Generality and Future-Proofing: By eschewing dedicated, single-function units in favor of a more programmable, unified fabric, the UniZK design offers a degree of future-proofing that is critical in the fast-moving field of cryptography. The performance breakdown in Table 1 (page 4) clearly motivates this, showing that no single kernel is overwhelmingly dominant. Their architecture can handle Plonky2 and Starky, and as discussed in Section 8.1 (page 13), it has a plausible path toward supporting other protocols like Spartan or Basefold that rely on similar polynomial and matrix-based primitives. This adaptability is a crucial advantage over more rigid, protocol-specific ASICs.
-
Excellent Performance and Insightful Comparison: The performance results are not just strong in isolation (97x vs. CPU, 46x vs. GPU) but are made more compelling by the comparison with PipeZK [72] (Section 7.5, page 12). The finding that UniZK, accelerating a more complex protocol (Starky+Plonky2), can outperform a specialized accelerator for a theoretically "simpler" protocol (Groth16) is a powerful testament to the combined benefits of algorithmic improvements and well-matched hardware architecture. It demonstrates that the right accelerator can unlock the performance potential of newer, more desirable cryptographic protocols.
-
Technical Depth in Kernel Mapping: The paper provides a technically sound and creative set of solutions for mapping highly diverse and irregular computations onto a regular hardware array. The strategies for handling variable-length NTTs, the complex dataflow of the Poseidon hash (Figure 5, page 8), and the dependency-bound partial products (Figure 6, page 9) are non-trivial and demonstrate a deep understanding of both the algorithms and the hardware.
Weaknesses
The weaknesses are less about fundamental flaws and more about the practical implications and boundaries of the proposed approach.
-
The Compiler Challenge is Understated: The paper notes in Section 5.5 (page 10) that the compiler frontend is currently manual. While this is acceptable for a research prototype, it hides a mountain of complexity. The true power of a flexible architecture is only unlocked by a robust compiler that can automatically and optimally map new kernels. The success of AI accelerators is as much a story of software (compilers like XLA and TVM) as it is of hardware. The paper would be strengthened by a more detailed discussion of the path toward a fully automated compilation flow and the challenges involved.
-
Limits of "Unification": The architecture is unified, but it is still highly specialized for modular arithmetic over 64-bit Goldilocks fields. The discussion on generality (Section 8.1, page 13) touches upon future protocols, but what happens when a fundamentally different primitive gains traction? For example, protocols like Binius [16, 17] rely heavily on binary field arithmetic. How gracefully could the UniZK architecture adapt to such a shift? A deeper exploration of the architectural breaking points would provide valuable context.
-
Positioning vs. Concurrent Heterogeneous Approaches: The related work section mentions NoCap [61], which seems to adopt a different philosophy of integrating a variety of dedicated functional units. This represents the primary alternative design choice. The paper would benefit from a more direct, comparative discussion of the pros and cons of UniZK's unified approach versus NoCap's heterogeneous approach (e.g., trade-offs in area efficiency for specific kernels, programming complexity, and flexibility for unknown future kernels).
Questions to Address In Rebuttal
-
The authors state that the compiler frontend for mapping ZKP functions to the computation graph is currently a manual process. Could you elaborate on the roadmap for automating this? What are the key research challenges in building a compiler that can efficiently map a diverse set of cryptographic kernels, including potentially new ones, onto the UniZK fabric?
-
The current design is optimized for 64-bit modular arithmetic. Could you comment on the architectural modifications and performance implications if one were to adapt UniZK to support protocols based on fundamentally different arithmetic, such as the binary field operations central to a protocol like Binius? What are the practical limits of the proposed architecture's flexibility?
-
Concurrent work like NoCap [61] proposes a heterogeneous multi-core architecture with specialized units. Could you provide a more detailed qualitative comparison of the trade-offs between your unified systolic-array approach and a heterogeneous approach? Specifically, in terms of silicon area, power efficiency for well-known kernels, and the ease of incorporating support for entirely new cryptographic primitives?
-
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Reviewer: The Innovator (Novelty Specialist)
Summary
This paper introduces UniZK, a hardware accelerator for modern, hash-based Zero-Knowledge Proof (ZKP) protocols like Plonky2 and Starky. The authors identify that emerging ZKP systems, unlike their classic elliptic-curve-based predecessors, feature a wide diversity of computational kernels (NTTs, various polynomial operations, Poseidon hash, Merkle trees). They argue that designing dedicated hardware for each kernel is inefficient.
The core claim of novelty is the proposal of a unified hardware architecture combined with flexible kernel mapping strategies. The architecture is based on an enhanced systolic array of Processing Elements (PEs), augmented with extra local links and a vector processing mode. The paper's main technical contribution lies in the novel mapping strategies that efficiently schedule these diverse and sometimes irregular ZKP kernels onto this regular hardware fabric.
Strengths
From a novelty perspective, the paper's strengths lie not in the invention of a new high-level concept, but in its specific application and a set of clever, domain-specific adaptations.
-
Novel Application of a Proven Paradigm: The primary contribution is the successful application of the "unified hardware, flexible mapping" paradigm to the domain of hash-based ZKP acceleration. While prior ZKP accelerators like PipeZK [72] focused on dedicated pipelines for a few dominant kernels, this work is the first, to my knowledge, to propose a general, systolic-array-based architecture for the broader and more diverse set of kernels found in modern ZKPs.
-
Novel, Domain-Specific Architectural Enhancements: The proposed architecture is not merely a generic systolic array. The novelty is in the specific enhancements tailored for ZKP kernels. The addition of reverse data links for accumulating results in the Poseidon hash mapping (Section 5.2, page 8) and the introduction of a "vector mode" for polynomial operations are non-obvious adaptations that are critical to the system's performance and are not present in standard systolic array designs.
-
Novel Mapping of Irregular Kernels: The most significant technical novelty is found in the mapping strategies presented in Section 5. The method for mapping the complex and irregular dataflow of the Poseidon hash's partial rounds onto a regular systolic structure (Figure 5b, page 8) is particularly insightful. Similarly, the techniques for handling variable-length NTTs and managing different data layouts (polynomial-major vs. index-major) on a unified piece of hardware represent a tangible step forward.
Weaknesses
The primary weakness of this paper, when viewed through the lens of pure innovation, is that its core philosophy is heavily borrowed from an adjacent, well-established field.
-
Core Philosophy is Not New: The central idea of using a unified, general hardware fabric (like a systolic array) and relying on intelligent software mapping to execute diverse workloads is the defining principle of the last decade of neural network accelerators. The authors themselves acknowledge this kinship, stating, "This approach is akin to the philosophy of modern neural network accelerators" (Section 3, page 4). Works like Google's TPU [33] and Eyeriss [10] pioneered this exact model of mapping various tensor operations (convolutions, matrix multiplies, etc.) onto a single, powerful systolic MAC array. Therefore, the claim of a "unified hardware and flexible kernel mapping" approach is not fundamentally new as a computer architecture concept.
-
Insufficient Differentiation from Conceptual Prior Art: The paper positions its novelty against prior ZKP accelerators, which is a fair but limited comparison. It fails to sufficiently articulate why a generic, off-the-shelf ML accelerator would be ill-suited for this task and how significant its own architectural "delta" is. The innovation would be clearer if the authors quantified the performance loss of mapping their kernels onto a vanilla systolic array versus their enhanced version.
-
Related Work in Other Cryptographic Domains: The concept of a programmable accelerator for cryptography is not entirely confined to ZKP. For instance, accelerators for Fully Homomorphic Encryption (FHE) such as F1 [59] and CraterLake [60] have also explored programmable dataflows to handle a variety of cryptographic operations (NTT, key switching, etc.) on a more general hardware substrate. While the specific kernels and constraints in ZKP are different, the conceptual overlap diminishes the absolute novelty of a "general" crypto accelerator.
Questions to Address In Rebuttal
To strengthen the paper's claims of novelty, the authors should address the following points:
-
Please clarify the novelty of the "unified hardware and flexible mapping" philosophy itself. Given that this is the dominant and highly successful paradigm in ML accelerators (e.g., Google TPU), what is the fundamental architectural insight in this paper beyond applying a known successful pattern to a new problem domain?
-
Could you quantify the importance of your specific architectural enhancements (the vector mode and extra local/reverse links) over a more generic systolic array from the ML domain? For example, how would the Poseidon hash mapping (Section 5.2) perform without the added reverse links, and what would the performance degradation be? This would help isolate the novelty of your hardware design from the novelty of the mapping effort.
-
The performance comparison against PipeZK [72] is compelling but compares two different protocols on two different architectural philosophies. A more challenging comparison for novelty would be against a hypothetical mapping of Plonky2 kernels onto an existing programmable accelerator like a TPU. Could you argue why your specialized-yet-unified solution is fundamentally superior to such an approach, thereby justifying the need for a new accelerator design rather than a new software stack for existing hardware?
-