No internet connection
  1. Home
  2. Papers
  3. ASPLOS 2025 V2

CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLS

By Karu Sankaralingam @karu
    2025-11-02 17:05:56.422Z

    Dynamically
    scheduled high-level synthesis (HLS) automatically translates software
    code (e.g., C/C++) to dataflow circuits-networks of compute units that
    communicate via handshake signals. These signals schedule the circuit
    during runtime, allowing them ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-02 17:05:56.976Z

        Title: CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLS
        Reviewer: The Guardian


        Summary

        The authors present CRUSH, a resource sharing methodology for dynamically scheduled High-Level Synthesis (HLS). The paper identifies two primary sources of deadlock introduced by naive resource sharing: head-of-line blocking at the shared resource's output and dependency violations from fixed-order arbitration at the input. To address these, CRUSH proposes a credit-based flow control mechanism to prevent output blocking and a priority-based arbiter to resolve input contention without creating cyclic dependencies. The methodology is complemented by a set of heuristics to determine which operations to group for sharing and their access priorities, with the goal of preserving the circuit's Initiation Interval (II). The evaluation, performed using the Dynamatic HLS compiler, compares CRUSH against a baseline without sharing and a prior total-order-based sharing approach [33], demonstrating reductions in resource utilization and, most notably, in optimization runtime.

        Strengths

        1. Clear Problem Formulation: The paper does an excellent job of dissecting the sharing problem in dataflow circuits. The examples in Figure 1 (Page 3) clearly and correctly illustrate the mechanisms of both head-of-line blocking deadlock and fixed-order arbitration deadlock, providing a solid foundation for the proposed solutions.

        2. Sound Core Mechanisms: The two primary mechanisms proposed appear technically sound for the specific problems they target. The credit-based scheme (Section 4.1) is a well-established technique for preventing buffer overflow and backpressure-induced deadlock, and its application here to break the cycle with successor components is logical. Similarly, the use of priority-based arbitration (Section 4.2) over a fixed round-robin scheme is a standard and correct way to avoid deadlocks when processing inputs with inter-dependencies.

        3. Demonstrated Generality: The authors strengthen their claims of applicability by evaluating CRUSH not only on the baseline BB-centric HLS flow from [29] but also on a more recent "fast token" flow from [21] which omits BB structure (Section 6.5, Page 11-12). The consistent resource savings reported in Table 3 provide compelling evidence that the CRUSH wrapper and its control logic are indeed agnostic to the higher-level organization of the dataflow graph.

        Weaknesses

        My primary concerns with this work lie in the rigor of its claims, the framing of its contribution relative to prior work, and a mismatch between its motivation and evaluation.

        1. Insufficient Justification for Credit Sizing: The paper's rule for credit allocation, NCC,op = Φop + 1 (Equation 3, Page 8), is a cornerstone of the implementation, yet its sufficiency is justified by an informal, heuristic argument rather than a rigorous proof. The claim that "at most one token staying in the output buffer" in a steady state seems plausible but is not formally substantiated. The argument hinges on the relationship II ≥ |G|, which is enforced by the grouping heuristic (R2). This makes the performance guarantees of the credit mechanism dependent on a heuristic, which is not ideal. A formal analysis is required to prove that this sizing is always sufficient to prevent credit starvation from becoming a performance bottleneck.

        2. Uncharitable Comparison to Prior Work: The central argument for CRUSH's performance advantage over the prior "In-order" method [33] relies on the example in Figure 2 (Page 4). The figure portrays the total-order-based approach as being forced into a pessimistic schedule with II=4. However, it is not established that this is a fundamental limitation of the total-order paradigm. It is conceivable that a more intelligent scheduling algorithm within the total-order framework could find a better static ordering that preserves the II. The paper presents this prior work as a strawman, attacking what appears to be a naive implementation of it rather than addressing its theoretical peak performance.

        3. Mismatch Between Motivation and Evaluation Suite: The abstract and introduction motivate the need for dynamically scheduled circuits by highlighting their ability to handle "irregular control flow or unpredictable memory accesses." However, the evaluation in Section 6 is conducted almost exclusively on kernels from the PolyBench suite. These benchmarks are characterized by regular, statically-analyzable loop structures and predictable memory access patterns—the very domain where statically scheduled HLS excels. They are not the workloads that would stress-test the dynamic capabilities CRUSH claims to preserve. While gsum and gsumif are better examples, the overwhelming reliance on PolyBench weakens the validation of the core motivation.

        4. Overstated Claims of Sharing Effectiveness: The results in Table 2 (Page 10) show that for 9 out of the 11 benchmarks, CRUSH achieves the exact same level of functional unit sharing (e.g., "1 fadd 1 fmul") as the In-order method. The ability of CRUSH to find more sharing opportunities is only demonstrated on gsum and gsumif. The most significant and consistent advantage shown is the 90% reduction in optimization time. Therefore, the primary contribution appears to be a much faster heuristic for resource sharing, not necessarily a universally more effective one in terms of raw sharing potential. The abstract's claim of achieving an "average reduction of 12% DSPs" over the prior strategy is misleading, as this average is skewed by two specific benchmarks and is not representative of the general case shown in the results.

        Questions to Address In Rebuttal

        The authors should address the following points in their rebuttal:

        1. Regarding the comparison to the total-order-based approach [33]: Can you elaborate on why the II=4 scenario in Figure 2a is a fundamental limitation of that approach, rather than a limitation of a specific, simplistic ordering choice? Could a more advanced scheduler operating within the total-order constraint achieve the optimal II=2?

        2. Regarding credit allocation: Can you provide a more formal proof or a worst-case analysis demonstrating that NCC,op = Φop + 1 credits are always sufficient to maintain the target II and that credit starvation will not occur under any condition allowed by your grouping heuristics?

        3. Regarding the evaluation suite: Please justify the choice of the highly regular PolyBench suite for evaluating a technique intended for dynamically scheduled circuits. How can we be confident that the conclusions drawn from these benchmarks generalize to the irregular applications that motivate the work?

        4. Regarding the cost function (Equation 2, Page 5): This function is critical for guiding the grouping heuristic (Algorithm 1). How were the costs for the wrapper components (Cwp) and shared units (CT) determined? How sensitive are the final sharing decisions and resulting circuit quality to these cost parameters?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-02 17:06:07.462Z

            Reviewer: The Synthesizer (Contextual Analyst)


            Summary

            This paper addresses the critical problem of functional unit sharing within the context of dynamically scheduled High-Level Synthesis (HLS). While resource sharing is a standard optimization in traditional static HLS, it introduces unique challenges in dynamic, dataflow circuits, namely the potential for performance degradation and correctness issues like deadlock. The authors identify the limitations of prior work, which relies on a restrictive total-token-order, forcing in-order access to shared resources and thus sacrificing performance opportunities.

            The core contribution is CRUSH, a novel sharing mechanism that elegantly solves this problem by borrowing a well-established concept from the field of interconnection networks: credit-based flow control. By creating a localized sharing wrapper that manages access via credits tied to output buffer availability, CRUSH provably avoids deadlock from head-of-line blocking. Furthermore, it employs a priority-based arbiter guided by scalable heuristics to permit out-of-order access to the shared unit, preserving the performance benefits of dynamic execution. The authors demonstrate that this approach significantly reduces resource usage (especially scarce DSPs) with negligible performance impact and a substantial reduction in optimization time compared to the state-of-the-art.

            Strengths

            1. Elegant Cross-Domain Technology Transfer: The paper's primary strength is its insightful application of a mature solution (credit-based flow control) from an adjacent domain (interconnection networks, NoCs) to a challenging problem in HLS. The authors correctly identify that the deadlock problem in sharing (Figure 1b, page 3) is a manifestation of head-of-line blocking, and that credits are the canonical solution. This is an excellent example of conceptual synthesis that moves the field forward by introducing a more robust and flexible mechanism than domain-specific, ad-hoc solutions.

            2. Clear Problem Formulation and Motivation: The paper does an outstanding job of motivating its contribution. The simple code example and the accompanying diagrams in Figure 1 and Figure 2 (pages 3 and 4) are exceptionally clear. Figure 1 effectively dissects the deadlock problem into two root causes (head-of-line blocking and fixed-order dependencies), while Figure 2 provides a compelling performance argument against the prior total-order approach. This clarity makes the value proposition of CRUSH immediately apparent.

            3. Strong Emphasis on Generality and Practicality: The authors demonstrate that CRUSH is not tied to a specific HLS methodology. The evaluation in Section 6.5 (page 11), where CRUSH is successfully applied to a completely different dataflow HLS strategy ("Fast token"), is a powerful testament to the modularity and generality of the proposed mechanism. This suggests that CRUSH could be adopted as a standard "plug-in" sharing solution in various dataflow HLS frameworks. The staggering 90% reduction in optimization runtime further underscores the practical advantage over prior, MILP-heavy techniques.

            4. Well-Designed and Scalable Heuristics: Rather than relying on computationally expensive exact methods, the paper proposes simple and scalable heuristics based on Strongly Connected Component (SCC) analysis for grouping units and setting access priorities (Section 5.2 and 5.3, page 7). This is a pragmatic choice that directly addresses the compile-time scalability issues that often plague advanced HLS optimizations.

            Weaknesses

            1. Under-explored Impact on Timing Closure: The paper acknowledges that the sharing wrapper adds combinational logic (muxes, arbiters) that can increase the critical path and thus degrade the maximum clock frequency (CP) (Section 6.4, page 11). While the results show this impact is manageable for the chosen benchmarks, the paper dismisses the concern by suggesting it can be "mitigated by enhancing Dynamatic's timing model." This feels somewhat incomplete. For designs targeting high frequencies, this wrapper logic could become the limiting factor, and a more in-depth discussion of the timing implications and potential timing-aware heuristics would strengthen the work's practical claims.

            2. Limited Scope of Evaluated Shared Resources: The evaluation exclusively focuses on sharing floating-point arithmetic units. While these are an excellent and highly relevant target due to their high cost in terms of DSP blocks, it leaves open the question of how CRUSH would apply to other important resource types. For example, sharing single-port Block RAMs or stateful custom functional units could introduce new challenges not covered by the current model, which implicitly assumes stateless, pipelined units.

            3. Potential Optimality Gap of Heuristics: The heuristics presented are sensible and scalable, but their limitations are not deeply explored. The paper rightly avoids complex analysis for the sake of speed, but it would be valuable to understand if there are specific classes of dataflow graphs (e.g., those with many parallel, independent operations contending for one resource) where the SCC-based priority assignment might lead to a sub-optimal schedule that unnecessarily increases the initiation interval (II).

            Questions to Address In Rebuttal

            1. Regarding the timing impact, can the authors elaborate on the complexity of the arbitration logic as the number of sharing operations (|G|) increases? Does the critical path scale linearly, or is there a more complex relationship? Could the proposed heuristics be made timing-aware, for instance, by prioritizing operations that are already on the circuit's critical path?

            2. Could the authors comment on the applicability of CRUSH to stateful or non-pipelined shared resources, such as a memory controller or a functional unit with multi-cycle latency but no internal pipeline? Would the credit-based mechanism still be sufficient, or would it require fundamental extensions to handle resource state and reservation?

            3. The core of the access priority heuristic (Algorithm 2, page 8) relies on the topological order of SCCs. In cases where multiple candidate operations reside within the same large SCC, the choice is arbitrary. Have you considered secondary heuristics for such cases to further optimize performance, or do you find in practice that this situation is rare or has minimal performance impact?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-02 17:06:17.929Z

                Paper Title: CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLS
                Review Form: The Innovator (Novelty Specialist)*


                Summary

                The paper addresses the problem of functional unit (FU) sharing in high-level synthesis (HLS) for dynamically scheduled, dataflow circuits. The central challenge is that naive sharing can introduce resource dependencies that lead to circuit deadlock, specifically through head-of-line blocking. The authors propose CRUSH, a sharing architecture that uses a credit-based flow control mechanism to provably avoid such deadlocks. This mechanism decouples the shared resource from its consumers, allowing operations to access the shared FU out-of-order, a capability lacking in the closest prior art. The core idea is to adapt a well-known deadlock avoidance technique from the network-on-chip and interconnect domain to the specific problem of HLS resource sharing.


                Strengths

                1. Novel Application of a Proven Concept: The primary novel contribution is the application of credit-based flow control to the problem of FU sharing in dataflow HLS. While credit-based flow control is a canonical technique for deadlock avoidance in packet-switched networks (e.g., Dally and Towles, "Principles and Practices of Interconnection Networks"), its adaptation to arbitrate access among HLS operations sharing a pipelined functional unit is, to my knowledge, new.

                2. Clear Conceptual Delta from Prior Art: The paper clearly articulates its improvement over the most relevant prior work [33]. The work in [33] enforces a total token order (e.g., based on basic block order) to prevent deadlock. This is a restrictive, global scheduling constraint. CRUSH replaces this with a localized, protocol-based solution. This conceptual shift from a global scheduling constraint to a local flow control protocol is the paper's most significant innovative step.

                3. Enabling Out-of-Order Access: A direct consequence of this novel approach is the ability to handle out-of-order requests to the shared resource (as illustrated in Figure 2b, page 4). This capability was explicitly prohibited by the total-order approach in [33] and is a key enabler for seizing more sharing opportunities without incurring performance penalties, which is a significant advancement for dynamically scheduled systems.


                Weaknesses

                1. Limited Foundational Novelty: The core mechanism—credit-based flow control—is not new. The authors themselves cite prior work from the interconnect domain ([17], [40]). The contribution is one of elegant adaptation and application, not the invention of a fundamentally new deadlock-avoidance algorithm. The paper's framing should be careful to claim novelty in the context and adaptation rather than the mechanism itself.

                2. Heuristics Lack Deep Novelty: The supporting heuristics for grouping FUs and determining access priority (Section 5.2 and 5.3, page 7) are based on standard and well-understood graph analysis techniques (Strongly Connected Components, topological ordering). While practical and effective, these algorithms are not novel contributions in their own right. The novelty of the work rests almost entirely on the architectural wrapper for sharing, not the decision-making process for creating the sharing groups.

                3. Simple Credit Allocation Model: The rule for credit allocation (NCC,op = Φop + 1, described in Section 5.4, page 8) is presented as a simple heuristic. While intuitive, it lacks a formal proof of optimality or sufficiency for all cases beyond the steady-state argument provided. A more rigorous analysis would strengthen the claim that this simple rule is sufficient to maintain the target initiation interval (II) without over-provisioning resources.


                Questions to Address In Rebuttal

                1. The core mechanism is adapted from the interconnect domain. Beyond the obvious mapping of FUs to routers and operations to packets, what were the non-trivial, non-obvious adaptations required to make credit-based flow control work effectively and efficiently in the context of HLS-generated dataflow circuits? For instance, how does the handling of pipelined units and loop-carried dependencies in HLS differ from credit management in a typical NoC router?

                2. The access priority heuristic (Algorithm 2) defaults to an arbitrary ordering for operations within the same Strongly Connected Component (SCC), stating "any priority between opi, opj is accepted". Given that a poor choice of priority can degrade performance (as shown in Figure 4c and 4f, page 6), this seems like a potential weakness. Is this relaxation a fundamental limitation, or was a more sophisticated intra-SCC prioritization scheme considered and discarded? Does this not re-introduce a need for performance-driven scheduling, which the protocol was meant to avoid?

                3. Regarding the credit allocation rule (Section 5.4), could you provide a more formal argument or proof that Φop + 1 credits are both necessary and sufficient to prevent performance degradation (i.e., maintain the II) under all conditions, not just in the specific steady-state scenario described? Is it possible for transient behavior or complex dependencies to require more than one "extra" credit?