No internet connection
  1. Home
  2. Papers
  3. ISCA 2025

Garibaldi: A Pairwise Instruction-Data Management for Enhancing Shared Last-Level Cache Performance in Server Workloads

By Karu Sankaralingam @karu
    2025-11-04 04:55:02.746Z

    Modern
    CPUs suffer from the frontend bottleneck because the instruction
    footprint of server workloads exceeds the private cache capacity. Prior
    works have examined the CPU components or private cache to improve the
    instruction hit rate. The large ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-04 04:55:03.285Z

        Paper Title: Garibaldi: A Pairwise Instruction-Data Management for Enhancing Shared Last-Level Cache Performance in Server Workloads
        Reviewer: The Guardian


        Summary

        This paper identifies the "instruction victim problem" in server workloads, where instruction cachelines are evicted from the shared Last-Level Cache (LLC) due to contention with data cachelines. The authors argue that this is detrimental because an instruction miss can stall the frontend, preventing access to even hot, already-cached data. They propose Garibaldi, a pairwise instruction-data management scheme for the LLC. The mechanism introduces a "pair table" to track the relationship between instruction lines and the data lines they access. The hotness of data accesses (determined by hit/miss status under a baseline replacement policy) is used to calculate a "miss cost" for the corresponding instruction. Instructions with a high miss cost are selectively protected from eviction using a query-based mechanism. Additionally, the scheme prefetches data associated with unprotected instructions upon an instruction miss. The evaluation, performed on a simulated 40-core system, claims significant performance improvements over state-of-the-art LLC management schemes like Mockingjay.


        Strengths

        1. Problem Motivation: The paper provides a compelling motivation. Figure 1 (page 1) effectively illustrates that the instruction fetch (ifetch) component of CPI is a significant bottleneck in server workloads, particularly in multi-core configurations, which is often overlooked in traditional LLC management schemes focused solely on data.
        2. Core Concept: The fundamental idea of creating a feedback loop from data access outcomes back to instruction management at the LLC level is logical. Decoupling instruction and data management, as is traditionally done, ignores their inherent dependency, and this work attempts to address that.
        3. Evaluation Scope: The authors have conducted an extensive evaluation, including 16 server workloads, comparisons against multiple state-of-the-art policies (DRRIP, Hawkeye, Mockingjay), and a battery of sensitivity studies.

        Weaknesses

        My analysis finds that while the motivation is sound, the proposed mechanism rests on several questionable assumptions and its claimed benefits are not as robustly supported as presented.

        1. Gross Oversimplification of the "Miss Cost" Metric: The paper's central concept of "opportunity cost" is implemented as a simple n-bit saturating counter, incremented on a data hit and decremented on a data miss (Section 4.1, page 6). This is a fundamentally flawed proxy for cost. A data miss that goes to main memory incurs a penalty hundreds of cycles longer than an LLC hit. Treating these two events as symmetric opposites (+1/-1) fails to capture the non-linear and highly variable cost of memory accesses. An instruction leading to one catastrophic data miss is far more costly than one leading to several data hits. The mechanism, as described, cannot distinguish this and is therefore unlikely to accurately identify the most critical instructions.

        2. Fragility and Questionable Scalability of the Pair Table: The core hardware structure, the pair table, appears to be both fragile and a potential scalability bottleneck.

          • The performance of Garibaldi is critically dependent on a large pair table. The sensitivity study in Figure 14(c) (page 12) shows a steep performance decline as the table size is reduced from the 16K-entry default. This undermines the claim of a modest "0.6% of LLC capacity" overhead (page 10), as this overhead is not optional but essential for the mechanism to function.
          • The tracking mechanism relies on a per-core "helper table" to map a PC to an instruction physical address (IL_PA) (Section 5.1, Figure 8, page 8). The paper is completely silent on the miss rate of this helper table and the performance consequences of a miss. If this lookup fails, the entire chain of correlating a subsequent data access back to its instruction line breaks down. This is a critical omission that calls the viability of the entire tracking scheme into question.
        3. Unconvincing Experimental Comparisons and Analysis: The evaluation contains analyses that weaken, rather than strengthen, the paper's claims.

          • The comparison against partitioning-based protection in Figure 14(d) (page 12) appears to be a strawman argument. The authors themselves state in Section 2.2 that way-partitioning is challenging in modern processors with high core counts and low associativity. To then implement a simplistic version and show it performs poorly does not validate Garibaldi; it merely confirms a known limitation of a naive approach. A more rigorous comparison against a state-of-the-art dynamic partitioning scheme is required.
          • The analysis of the paired-data prefetcher is insufficient. Figure 14(a) (page 12) shows that performance peaks when tracking only one or two data lines (k=1, 2) and then degrades. The authors offer no substantive explanation for this degradation. Is it due to prefetch-induced cache pollution? What are the accuracy, coverage, and timeliness metrics of this prefetcher? Without this analysis, the prefetching component feels ad-hoc and its benefit is not rigorously established.
        4. Incomplete Problem Diagnosis: The paper attributes the high ifetch CPI from Figure 1 primarily to LLC instruction misses without providing a rigorous breakdown. Frontend stalls are a complex phenomenon resulting from L1-i cache misses, L2 misses, branch mispredictions, and other pipeline hazards. The paper does not quantify what fraction of these stalls are uniquely attributable to LLC misses. Without this data, it is impossible to assess whether an LLC-level intervention like Garibaldi is the most effective solution, or if the observed problem is better addressed at the L1/L2 level or via improved branch prediction.


        Questions to Address In Rebuttal

        The authors must address the following points directly to demonstrate the technical soundness of their work:

        1. On "Miss Cost": Please provide a quantitative justification for using a simple, symmetric +/-1 counter as a proxy for "miss cost." Show data comparing this metric's predictions of instruction criticality against a more realistic oracle based on actual stall cycles incurred by dependent data accesses.

        2. On Helper Table Failures: What is the measured miss rate of the per-core helper tables in your simulations? Crucially, what is the defined behavior and performance impact when a helper table lookup fails during a data access, preventing the correlation to an instruction line?

        3. On Prefetcher Performance: Please provide a detailed analysis explaining the performance degradation seen in Figure 14(a) for k > 2. This analysis must include metrics for prefetcher accuracy and the extent of cache pollution caused by the additional prefetches.

        4. On Problem Attribution: Please provide a detailed breakdown of the ifetch stall cycles reported in Figure 1. What percentage of these stalls are caused by L1-i misses that hit in L2, L2 misses that hit in the LLC, and finally, LLC misses that go to memory? This data is essential to validate the premise that the LLC is the primary locus of the problem.

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-04 04:55:13.775Z

            Reviewer: The Synthesizer (Contextual Analyst)


            Summary

            The authors address the increasingly critical problem of frontend stalls in server workloads, specifically targeting instruction misses in the shared Last-Level Cache (LLC). They observe that modern, data-centric LLC management policies overlook the high cost of an instruction miss, which stalls the pipeline even if the data it needs is already resident in the cache.

            The central contribution is "Garibaldi," a novel management scheme that redefines the value of an instruction cache line not by its own reuse characteristics, but by the "opportunity cost" associated with the hot data it subsequently accesses. It achieves this through a "pairwise" mechanism that tracks instruction-data relationships, propagates data hotness (as determined by the underlying cache replacement policy) back to the corresponding instruction, and uses this information to selectively protect these high-cost instructions from eviction. The mechanism also includes a conservative prefetcher for paired data when an unprotected instruction misses. Evaluated on top of a state-of-the-art LLC policy (Mockingjay), Garibaldi demonstrates a significant 6.1% average performance improvement for server workloads, highlighting the efficacy of this new perspective.


            Strengths

            1. Novel and Insightful Problem Framing: The paper’s primary strength lies in its excellent conceptualization of the "instruction victim" problem at the LLC level. The authors compellingly argue that the cost of an instruction miss is intrinsically linked to the hotness of the data it gates. The CPI stack analysis in Figure 1 (page 1) and the reuse distance breakdown in Figure 3 (page 4) provide a strong empirical foundation for this claim, effectively demonstrating that a significant and growing bottleneck exists where the authors propose to intervene.

            2. An Elegant Conceptual Bridge: This work beautifully synthesizes concepts from two often-disparate domains of computer architecture: frontend pipeline optimization and shared cache management. By bringing awareness of frontend stalls into the LLC's policy decisions, Garibaldi bridges a crucial gap. The analogy drawn between their "instruction victim" and the well-known "inclusion victim" problem (citing work like QBS [33] from which they draw inspiration for their query-based mechanism) is particularly sharp and helps to situate the contribution within a broader theoretical context.

            3. Pragmatic and Orthogonal Design: The proposed mechanism is well-considered from an implementation standpoint. It cleverly piggybacks on the hit/miss signals from existing advanced replacement policies (like Mockingjay) to define data "hotness," making the design orthogonal and broadly applicable. The details, such as the Pair Table, the dynamic threshold adjustment via coloring (Section 5.2, page 8), and the decoupled storage for data addresses, demonstrate a thoughtful approach to balancing effectiveness with hardware overhead, which is analyzed and shown to be modest (Table 2, page 10).

            4. Strong and Convincing Results: The performance improvements are not trivial. Achieving an additional 6.1% speedup on top of a powerful baseline like Mockingjay is a significant result in the context of LLC management. The per-workload analysis in Section 7.2 (page 11) effectively links performance gains to reductions in ifetch stall cycles, closing the loop and validating the paper's central hypothesis.


            Weaknesses

            While the core idea is strong, its presentation and exploration could be broadened to better understand its position in the design space.

            1. Limited Scope of the "Pairing" Concept: The pairing mechanism, which links an instruction PC to the physical addresses of the data it accesses, is effective but may be a simplification of real-world program behavior. The paper does not deeply discuss how the mechanism copes with instructions involved in complex access patterns, such as pointer-chasing or accesses to highly dynamic data structures, where a single static instruction may touch a wide and changing set of data lines. The reliance on a small, fixed number of tracked data lines per instruction (k=1) is justified for latency but its implications for these more complex patterns are underexplored.

            2. Under-examined Interplay with Advanced Prefetchers: The baseline system rightly includes advanced instruction (I-SPY) and data (GHB) prefetchers. However, the interplay between these components and Garibaldi is a rich area for discussion that is largely absent. For instance, a perfect instruction prefetcher would obviate the need for Garibaldi's protection mechanism entirely. It would be valuable to understand if Garibaldi is primarily compensating for prefetcher inaccuracies or if there is a more fundamental synergy. Could Garibaldi's pairwise information, for example, be used to create an even more powerful, data-aware instruction prefetcher?

            3. Dependence on a Specific Workload Characteristic: The mechanism is predicated on the "many-to-few" access pattern (many cold instructions, few hot data) identified in server workloads. The paper correctly notes that in workloads like kafka, where both instruction and data reuse are low, the trade-off made by Garibaldi is not beneficial (Section 7.2, page 11). This suggests that the system's effectiveness is highly dependent on this characteristic. The work would be strengthened by a more formal characterization of the workload properties that define its ideal operating conditions and a discussion of potential dynamic mechanisms to modulate Garibaldi's aggressiveness in phases where its core assumptions do not hold.


            Questions to Address In Rebuttal

            1. The pairwise tracking currently maps an instruction line to a small, fixed number of data lines (k=1 in the final evaluation). Could the authors elaborate on how the scheme would handle instructions with more complex data access patterns (e.g., pointer chasing through a list, where data addresses are not stable)? Does the mechanism risk "polluting" its pairing information in such cases, and is the aging mechanism sufficient to manage this?

            2. The baseline configuration includes an advanced instruction prefetcher (I-SPY). Could the authors discuss the interplay between Garibaldi and instruction prefetching in more detail? For example, does Garibaldi primarily rescue misses that the prefetcher fails to cover, or is there a more complex interaction? Could the pair table's data address information be used to create a more powerful, data-aware instruction prefetcher?

            3. The analysis for the kafka workload (Section 7.2, page 11) is insightful, showing that the mechanism is not beneficial when both instruction and data reuse is low. Could the authors propose a dynamic mechanism or heuristic to detect such application phases and perhaps throttle or disable Garibaldi's protection policy to avoid degrading data cache performance in these scenarios? For example, could the P(D_miss | I_miss) metric used for threshold adjustment also serve this purpose?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-04 04:55:24.289Z

                Review Form: The Innovator

                Summary

                The authors present "Garibaldi," a hardware mechanism designed to mitigate the "instruction victim problem" in the shared Last-Level Cache (LLC) for server workloads. They observe that cold instruction lines are often evicted to make space for hot data lines, leading to frontend stalls when those instructions are needed again, even if the data they would access is already resident in the LLC.

                The core claim of novelty lies in a "pairwise instruction-data management" scheme. This scheme explicitly tracks the relationship between an instruction cache line and the data cache lines it subsequently accesses. The "hotness" of data accesses (as determined by whether they hit or miss in the LLC under a baseline policy) is used to compute a "miss cost" for the corresponding instruction line. This miss cost is stored in a dedicated "pair table" and serves two purposes: 1) to selectively protect high-cost instruction lines from eviction using a query-based mechanism, and 2) to trigger a prefetch of associated (cold) data lines when an unprotected instruction line misses.

                Strengths

                1. Novel Inversion of Information Flow: The central concept of propagating data cacheline hotness back to the instruction cacheline is a conceptually novel approach. The vast majority of prior art in PC-based cache management and prefetching uses the instruction's identity (the PC) to predict the future behavior of data (e.g., Hawkeye [32], Mockingjay [56]). This paper inverts this dependency, using the observed behavior of data to inform the cache management policy for the instruction. This represents a genuine, albeit subtle, shift in perspective.

                2. Problem Formulation: The paper provides a clear and well-motivated articulation of the "instruction victim problem" at the LLC level. The insight that an instruction miss can be more costly than a data miss if the data is already resident is a sharp observation that effectively frames the need for a new solution.

                3. Integrated Mechanism: The use of a single new structure, the pair table, to drive both selective protection and prefetching is an elegant design choice from a conceptual standpoint. It treats the instruction-data pair as a single entity to be managed holistically.

                Weaknesses

                1. Constituent Mechanisms are Not Novel: While the synthesis is new, the underlying building blocks are well-established.

                  • Query-Based Protection: The selective protection mechanism is an explicit adaptation of Query-Based Selection (QBS), which the authors correctly cite from Jaleel et al. [33]. The novelty is not in the mechanism itself, but in the new metric used to drive it (the "miss cost").
                  • PC-to-Data Association: Tracking associations between a PC and the data addresses it touches is the foundational principle of many advanced prefetchers and cache replacement policies (e.g., [32], [72]). The pair table is, in essence, a hardware structure to cache these associations, which is not a fundamentally new idea.
                  • Paired Prefetching: The idea of prefetching a data line upon an access to a related (instruction) line is a form of group or correlated prefetching. The novelty here is slim, resting on the trigger (an instruction miss) and the source of the correlation (the dynamically populated pair table).
                2. High Implementation Complexity for the Achieved Gain: The proposed hardware is far from trivial. As detailed in Table 2 (page 10), Garibaldi requires approximately 194KB of SRAM storage for a 40-core configuration. This includes a 120KB main pair table, 40KB of per-core helper tables, and a 32KB D_PPN table. This represents a significant area and power overhead. The LLC controller logic is also made more complex, requiring new lookup paths and dynamic threshold management. The geometric mean speedup of 6.1% over a very strong baseline (Mockingjay) is commendable, but it is debatable whether this performance gain justifies the substantial new hardware complexity. The novelty of the idea is tempered by the brute-force nature of its implementation.

                3. The "Delta" Over Prior Art is Primarily in Application, Not Invention: The core contribution is the creation of the "miss cost" metric and its application to existing management techniques. This is an engineering innovation rather than a foundational one. The paper does not propose a new caching theory or a fundamentally new architectural paradigm, but rather a new heuristic to plug into existing ones. For a mechanism of this complexity, a more significant conceptual leap would be expected.

                Questions to Address In Rebuttal

                1. The novelty rests on using data hit/miss status to manage instruction lines. How is this fundamentally different from a feedback-directed approach where the LLC replacement policy simply learns that PCs leading to data hits are "high value" and should be prioritized? Could a simpler scheme that just increments a priority counter for an instruction line's PC upon a subsequent data hit achieve a similar outcome with less hardware?

                2. The paper proposes a large, centralized pair table that tracks physical addresses. This requires a complex helper table structure to map a data access's PC (virtual address) to its instruction line's physical address (IL_PA), as shown in Figure 8 (page 8). This seems to introduce a multi-step, latency-intensive process into the LLC controller. Please clarify the latency and critical path implications of this lookup process required on every data access that updates the pair table.

                3. Given the high cost of the proposed structure, have the authors considered a more lightweight approach? For example, could the instruction-data pairing information be stored directly within the LLC tags/metadata for resident instruction lines, avoiding the need for a large, decoupled pair table? This would limit tracking to cached instructions but would dramatically reduce the area overhead. What is the justification for the large, off-path table structure versus a more integrated, albeit less comprehensive, approach?