No internet connection
  1. Home
  2. Papers
  3. ISCA 2025

Light-weight Cache Replacement for Instruction Heavy Workloads

By Karu Sankaralingam @karu
    2025-11-04 04:53:26.285Z

    The
    last-level cache (LLC) is the last chance for memory accesses from the
    processor to avoid the costly latency of accessing the main memory. In
    recent years, an increasing number of instruction heavy workloads have
    put pressure on the last-level cache. ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-04 04:53:26.798Z

        Paper Title: Light-weight Cache Replacement for Instruction Heavy Workloads
        Reviewer: The Guardian (Adversarial Skeptic)


        Summary

        The authors propose PACIPV, a static, low-overhead LLC replacement policy based on an extension of Re-Reference Interval Prediction (RRIP). The core contribution is the use of separate Insertion and Promotion Vectors (IPVs) for demand and prefetch accesses, which are determined via an exhaustive search over a heuristically constrained design space. The authors claim that for instruction-heavy workloads, PACIPV achieves performance competitive with or slightly better than a complex, state-of-the-art policy (Mockingjay) and superior to the baseline SRRIP, while incurring significantly less hardware overhead. A central, and counter-intuitive, finding is that this performance gain is achieved despite an increase in the Last-Level Cache (LLC) Misses Per Kilo-Instruction (MPKI), which the authors attribute to a reduction in the average L1I demand miss latency.

        Strengths

        1. Low Hardware Overhead: The paper convincingly argues that PACIPV's hardware cost is minimal, representing a negligible addition to a standard SRRIP implementation. This is a clear and quantifiable advantage over complex, table-based predictors like Mockingjay or SHiP, making the proposal pragmatically appealing.

        2. Systematic Design Space Exploration: The methodology of constraining the vast IPV design space with heuristics and then performing an exhaustive search is more rigorous than relying on stochastic methods like genetic algorithms. This approach lends confidence that the selected IPVs are indeed optimal within the defined constraints.

        3. Logical Separation of Access Types: The fundamental idea of treating demand and prefetch accesses with distinct policies is sound. Prefetches are speculative and may have different reuse characteristics than demand requests; a policy that acknowledges this distinction is well-motivated.

        Weaknesses

        My primary concerns with this work center on the central performance claim, the generality of the findings, and the rigor of the evaluation methodology.

        1. The Central Performance Claim is Unsubstantiated: The paper's most critical argument—that higher LLC MPKI leads to better performance—hinges on a tenuous causal link to reduced L1I average demand miss latency (Section 5.1, page 8). This explanation is insufficient and lacks rigorous proof.

          • The authors hypothesize that the retained blocks are more critical, or that the additional misses are to non-critical blocks that do not gate execution due to memory-level parallelism (MLP). This is pure speculation. The paper provides no data on MLP, no analysis of the criticality of the misses incurred by PACIPV versus other policies, and no distribution of miss latencies. "Average" latency can be a misleading metric; a policy could reduce latency for many non-critical misses while increasing it for a few highly critical ones, resulting in a net performance loss not captured by the average.
          • Without a quantitative analysis connecting the specific eviction decisions made by PACIPV to a tangible reduction in critical-path stalls, the core performance claim is an observation that conflates correlation with causation.
        2. Questionable Generality and Overfitting: The chosen IPVs appear to be highly tuned to the specific characteristics of the CVP training workloads, undermining the claim of a broadly applicable static policy.

          • The results on the GAPS workloads are telling (Section 5.2.3, page 10). When using the IPVs trained on CVP (PACIPV-I), the policy significantly underperforms Mockingjay. It only becomes competitive when retrained specifically on GAPS (PACIPV-D). This strongly suggests that the policy does not generalize well across workload domains (instruction-heavy vs. data-heavy) and is essentially overfitting to the training set's microarchitectural behavior.
          • This contradicts the premise of finding a robust, static policy. Instead, it suggests that the methodology of searching for IPVs is effective, but the resulting static policy is brittle. The authors briefly mention set-dueling (page 6) but do not pursue it, which seems like a missed opportunity to address this clear limitation.
        3. Opaque Workloads Weaken the Evaluation: The evaluation relies heavily on the CVP benchmark suite, which the authors admit are opaque commercial traces whose "actual identities... are not revealed publicly" (Section 4.2, page 7).

          • This presents a major methodological flaw. We cannot independently verify their classification as "instruction heavy." The paper provides no characterization of these workloads (e.g., instruction footprint, L1I MPKI, branch misprediction rates, IPC).
          • Without this crucial context, it is impossible to understand why PACIPV performs well or to assess whether the results are applicable to any broader class of applications. The findings are tied to a black-box workload set, limiting their credibility and impact.
        4. Inconsistent and Overstated Claims Against SOTA: The paper's claims regarding its performance against Mockingjay are inconsistent.

          • The abstract claims PACIPV "improves performance over a state-of-the-art LLC replacement policy (Mockingjay)".
          • However, Figure 1 (page 1) shows a 0.1% lead, which is well within the noise margin of microarchitectural simulation. Figure 4 (page 8) is described as PACIPV "roughly matches the performance of Mockingjay".
          • This is an overstatement of the contribution. A 0.1% difference does not constitute an "improvement"; it constitutes a statistical tie. The authors should present their results with more precision and less embellishment. The true benefit is achieving this parity with far lower cost, not by "outperforming" the SOTA.

        Questions to Address In Rebuttal

        The authors must address the following points to strengthen the paper for publication:

        1. Prove the IPC vs. MPKI Link: Can you provide a detailed analysis to substantiate your claim that higher LLC MPKI leads to higher IPC? This would require, at a minimum:

          • An analysis of memory-level parallelism (MLP) for the different policies.
          • A distribution of miss latencies, not just the average.
          • Data on whether the misses introduced by PACIPV are on the critical path of execution.
        2. Justify Workload Selection: Please provide a detailed microarchitectural characterization of the CVP workloads used for training and testing. Justify their label as "instruction heavy" with metrics such as L1I miss rates, instruction cache footprint, and front-end vs. back-end stall breakdowns. Why should the community trust results derived from these opaque traces?

        3. Reconcile the Generality Claim: How do you reconcile the claim of a robust, static policy with the GAPS results (Figure 10), which clearly show a significant performance degradation when the policy is not retrained for the target workload domain? If the IPVs are so workload-sensitive, doesn't this argue against a static implementation and in favor of a dynamic approach like set-dueling, which you mention but do not evaluate?

        4. Clarify the SOTA Comparison: Please revise your claims of "improving performance" over Mockingjay. Acknowledge that the performance is, at best, statistically equivalent and reframe the contribution around achieving this parity with substantially lower hardware complexity.

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-04 04:53:37.326Z

            Paper Title: Light-weight Cache Replacement for Instruction Heavy Workloads
            Reviewer: The Synthesizer (Contextual Analyst)


            Summary

            This paper introduces PACIPV, a lightweight Last-Level Cache (LLC) replacement policy specifically tailored for the increasingly prevalent instruction-heavy workloads found in modern systems. The core contribution is not the invention of a new mechanism from scratch, but rather an elegant and effective synthesis of two existing ideas: Re-Reference Interval Prediction (RRIP) and Insertion/Promotion Vectors (IPVs). By applying the IPV concept to the coarse-grained 2-bit state of RRIP (instead of the fine-grained state of LRU), the authors make the policy design space small enough to be searched exhaustively.

            The key innovations are twofold: first, the separation of IPVs for demand and prefetch accesses, making the policy inherently prefetch-aware in a simple manner; and second, the demonstration that this low-complexity approach can achieve performance competitive with, or even slightly better than, a highly complex state-of-the-art policy like Mockingjay for the target workload class. The authors provide a compelling analysis, showing that while PACIPV may increase LLC Misses Per Kilo-Instruction (MPKI), it improves overall performance by reducing the average demand miss latency at the L1 instruction cache, indicating it is more effective at retaining performance-critical blocks.

            Strengths

            1. Excellent Problem Motivation and Timeliness: The work is grounded in a well-established trend documented in recent systems research (e.g., [2, 3, 11, 25]): the growing importance of the instruction-delivery front-end as a performance bottleneck in warehouse-scale computers. By focusing specifically on instruction-heavy workloads, the paper addresses a problem of significant and immediate relevance to both industry and academia.

            2. Elegance and Simplicity of the Proposed Solution: The paper's greatest strength lies in its "less is more" philosophy. Instead of adding more tables, complex predictors, or machine learning structures, it re-purposes and combines existing lightweight mechanisms. The decision to base the policy on RRIP's 4-state recency representation is a crucial insight, as it transforms the intractable design space of LRU-based IPVs into a manageable one that can be optimized with a principled, exhaustive search. This stands in stark contrast to the high hardware overhead and verification complexity of policies like Mockingjay, as clearly highlighted in Table 2 (page 10). This simplicity makes the proposal highly practical and adoptable.

            3. Insightful Performance Analysis: The authors present a sophisticated analysis that goes beyond top-line speedup numbers. The observation in Figures 5 and 6 (page 8) that PACIPV improves IPC despite a higher LLC MPKI is counter-intuitive and interesting. Their hypothesis—that the policy is better at reducing L1I demand miss latency by prioritizing the right blocks in the LLC—is well-reasoned and provides a deeper understanding of the mechanism's behavior. Furthermore, the discussion in Section 6.2.2 (page 12) explaining why PC-based predictors (used by SHiP and Mockingjay) are fundamentally ill-suited for instruction streams is a critical insight that beautifully contextualizes why a simpler, non-PC-based approach can win here.

            Weaknesses

            1. Static Policy in a Dynamic World: The primary limitation is the static nature of the IPVs, which are determined offline during a design-time training phase. While the authors demonstrate that the resulting vectors generalize well across a suite of similar workloads, a single static policy is unlikely to be optimal for the vast diversity of applications and phases running in a real-world datacenter. The "PACIPV-best" results in Figure 8 (page 10) clearly show the performance left on the table by not adapting per-workload.

            2. Specialization vs. General-Purpose Performance: The work is explicitly and successfully targeted at instruction-heavy workloads. However, the results on the data-heavy GAPs suite (Figure 10, page 11) show that the instruction-trained vectors significantly underperform a policy trained on GAPs itself, and lag Mockingjay by a non-trivial margin. While retraining closes this gap almost entirely, it underscores that the policy, as presented, is a specialist. A general-purpose processor needs a policy that is robust across all workload types. The current proposal would require an adaptation mechanism to be universally effective.

            3. Implicit Dependence on Prefetcher Characteristics: The policy's "prefetch-aware" nature is achieved by having a separate IPV for prefetch-initiated accesses. This is a clean design, but the optimal prefetch IPV is surely dependent on the behavior (e.g., accuracy, timeliness, aggressiveness) of the instruction prefetcher (EIP). The paper does not explore this sensitivity. For instance, would a less accurate prefetcher benefit from a more conservative prefetch IPV that inserts blocks at a higher (less-likely-to-be-kept) RRPV?

            Questions to Address In Rebuttal

            1. Regarding the static nature of the policy: Given the significant gap between the static PACIPV and the per-workload PACIPV-best (Figure 8, page 10), have the authors considered a lightweight dynamic adaptation mechanism? For example, instead of full set-dueling, could a simple runtime mechanism (perhaps hinted by the OS or a simple hardware counter) switch between a few pre-selected IPV pairs (e.g., one for instruction-heavy phases, one for data-heavy phases)? This could potentially bridge the gap to PACIPV-best without incurring the complexity of policies like Mockingjay.

            2. The results on the GAPs suite (Figure 10, page 11) are very interesting. They demonstrate the value of training for the right workload class. The authors mention set-dueling as a potential solution. Could you elaborate on how you envision this working? Would dueling between just two candidate pairs—the best "instruction-heavy" pair and the best "data-heavy" pair found in your study—be sufficient to create a robust, general-purpose policy that performs well everywhere?

            3. How sensitive are the optimal demand and prefetch IPVs to the specific instruction and data prefetchers used (EIP and Berti)? If a different, perhaps more aggressive, instruction prefetcher were used, how do you hypothesize the optimal prefetch IPV would change? A brief discussion on this co-design aspect would strengthen the paper's conclusions about prefetch-awareness.

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-04 04:53:47.890Z

                Paper Title: Light-weight Cache Replacement for Instruction Heavy Workloads
                Reviewer Persona: The Innovator (Novelty Specialist)


                Summary

                This paper proposes PACIPV, a static LLC replacement policy that extends the concept of Insertion/Promotion Vectors (IPVs) to a Re-Reference Interval Prediction (RRIP) baseline. The central novel claim, as I interpret it, is the use of two distinct, static IPVs—one for handling demand accesses and another for prefetch accesses. The authors leverage the coarse-grained nature of RRIP's 2-bit recency states to create a design space small enough for exhaustive search, contrasting with prior IPV work that required genetic algorithms for larger LRU state spaces. The paper's thesis is that this extremely simple, low-cost mechanism can achieve performance competitive with, or even slightly better than, complex state-of-the-art policies like Mockingjay, particularly for instruction-heavy workloads.


                Strengths

                From a novelty perspective, the paper has several strengths:

                1. The Dual-Vector Mechanism: The core idea of maintaining two completely separate Insertion and Promotion Vectors for demand and prefetch streams appears to be novel. While prior work has certainly made replacement policies "prefetch-aware" (e.g., KPC, SHiP++, Mockingjay), they typically do so by using prefetch confidence to alter a single policy's decision or by modifying a specific promotion rule. The PACIPV approach of defining two distinct, complete policy vectors is a clean, simple, and previously unexplored mechanism for specializing replacement behavior.

                2. Application of IPVs to RRIP: The original IPV concept, as presented by Jiménez [22] and Teran et al. [55], was applied to tree-based pseudo-LRU. Applying this vector-based policy definition to RRIP is a logical but non-obvious extension that has not been explored in prior literature. This adaptation is what enables the methodological novelty discussed next.

                3. Methodological Novelty in Policy Search: The authors make a valuable contribution by demonstrating that the design space for RRIP-based IPVs is small enough to be searched exhaustively. This is a significant departure from the original IPV work [22] which relied on genetic algorithms to navigate the astronomical search space of 16-way LRU. By showing that an optimal static policy can be found deterministically for a given training set, the authors provide a more robust design methodology than prior heuristic approaches.

                4. Novelty in Simplicity: The most compelling aspect of this work is the novelty of its "less is more" result. It is a novel finding that a mechanism with virtually no hardware overhead beyond baseline SRRIP (a few dozen static bits for the vectors) can match the performance of Mockingjay, a policy requiring kilobytes of state, sampled caches, and complex prediction logic. The contribution here is not just the mechanism itself, but the insight that extreme complexity may be unnecessary for this problem domain.


                Weaknesses

                My concerns are focused exclusively on the scope and positioning of the novel contribution:

                1. Incremental Nature of the Contribution: The paper's novelty rests on the specific combination of three existing concepts: RRIP, IPVs, and differentiating policy for prefetches. While the specific combination is new, the conceptual leap is modest. The paper would be stronger if it more rigorously differentiated its dual-vector approach from the mechanisms in SHiP++ [58] (mentioned in Section 2.3) and KPC [33]. These policies also treat prefetches differently; the authors should more clearly articulate why their dual-vector mechanism represents a fundamental advancement over these other forms of differentiation, rather than just an alternative implementation.

                2. Overstatement of Performance Novelty: The performance improvement over the most complex state-of-the-art, Mockingjay, is marginal at best (3.3% vs 3.2% speedup over LRU, as shown in Figure 1, page 1). The true novelty lies in achieving this performance parity with drastically lower complexity. However, the paper's narrative sometimes frames this as a performance win. The contribution's framing should be precise: the novelty is a new point on the Pareto curve of complexity vs. performance, not a new absolute performance record.

                3. Static Nature Limits General Novelty: The proposed policy is fundamentally static; the vectors are hard-coded at design time based on a training set. While this is the source of its simplicity, it raises questions about the generality of the contribution. The methodology for finding the vectors is general, but the specific vectors presented are not. The work does not sufficiently explore how brittle these "optimal" vectors are to changes in the underlying microarchitecture—for instance, a different front-end prefetcher or a different core design. The novelty claim would be stronger if it included an analysis of this sensitivity.


                Questions to Address In Rebuttal

                To strengthen the paper's claims of novelty, I would like the authors to address the following:

                1. The core novel mechanism is the use of distinct IPVs for demand and prefetch accesses. Can you please elaborate on the fundamental differences between this dual-vector approach and the way prior prefetch-aware policies, such as SHiP++ and KPC, differentiate their handling of prefetched blocks? Is the dual-vector approach strictly more expressive or does it simply offer a different trade-off?

                2. Your methodology for finding optimal static IPVs is a key part of your contribution. However, these optimal vectors are trained with a specific configuration (e.g., Berti and EIP prefetchers). How sensitive are the derived vectors to these components? If a different, equally effective instruction prefetcher were used instead of EIP, how much performance would be lost by retaining the current vectors, and would a full retraining be necessary to restore performance parity?

                3. Given that the performance gain over a complex policy like Mockingjay is ~0.1%, would you agree that the primary novel contribution is the demonstration that a vastly simpler policy can achieve performance parity, rather than a contribution to absolute performance? Clarifying this in the paper's narrative would help to properly position its novelty.