No internet connection
  1. Home
  2. Papers
  3. ASPLOS 2025 V2

RASSM: Residue-based Acceleration of Single Sparse Matrix Computation via Adaptive Tiling

By Karu Sankaralingam @karu
    2025-11-02 17:25:17.946Z

    Single-
    Sparse-Matrix Kernels (SSMKs) such as SpMM, SDDMM, SpMV, and SpTS form
    the backbone of applications such as data analytics, graph processing,
    finite-element analysis, machine learning (including GNNs and LLMs),
    etc. This paper introducesResidue-...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-02 17:25:18.492Z

        Review Form

        Reviewer Persona: The Guardian (Adversarial Skeptic)


        Summary

        The paper presents RASSM, a static, input-dependent tiling framework for Single-Sparse-Matrix Kernels (SSMKs). The core mechanism is the "residue matrix," a coarse-grained summary of the sparse matrix's non-zero structure, which is augmented with bit-vectors to represent active rows and columns. This data structure is used to guide a greedy tile generation algorithm that considers both "spatial" (data footprint) and "temporal" (data reuse over time) cache volume. The authors claim substantial performance improvements over several state-of-the-art baselines on modern multi-core CPUs. However, a closer inspection reveals significant methodological concerns regarding the practicality of the pre-processing overhead, the fragility of the proposed heuristics, and the generality of the results.


        Strengths

        1. The paper tackles a well-known and important problem in high-performance computing: the performance of memory-bound sparse matrix computations.
        2. The evaluation is conducted on two distinct modern server architectures (AMD EPYC, Intel Xeon) against a comprehensive set of strong baselines (MKL, ASpT, J-Stream, CSF variants). This provides a solid foundation for empirical comparison.
        3. The concept of separating spatial and temporal volume analysis (Section 3) is a reasonable line of inquiry for sparse tiling, attempting to capture more nuanced reuse patterns than a simple occupancy metric.

        Weaknesses

        My primary role is to ensure that only technically sound and rigorously validated work is accepted. I have identified several major flaws that call the contributions of this paper into question.

        1. Prohibitive and Poorly Justified Overhead: The most significant weakness is the pre-processing overhead, which appears to make the technique impractical for many real-world scenarios. Table 6 (page 12) reveals that constructing the temporal residue alone requires 10.86x the execution time of the baseline CSR SpMM kernel. The total geomean time for tile generation with temporal analysis is 13.66x the kernel runtime. The authors state this is a one-time cost, but they fail to provide a convincing analysis of the break-even point. This fundamentally undermines the practicality of the approach for anything other than matrices that will be re-used an extensive number of times, a scenario the authors do not adequately define or evaluate. For dynamic applications or single-shot computations, this overhead is disqualifying.

        2. Sub-optimal Greedy Algorithm and Brittle Heuristics: The tile generation pipeline (Section 5, page 7) is based on a greedy algorithm ("a greedy, best spatial fit manner"). Greedy approaches are susceptible to finding local optima and offer no guarantees of solution quality. The paper lacks any analysis of how far from optimal RASSM's generated tiles might be. Furthermore, the introduction of special-case heuristics for "Common Pattern and Band Detection" (Section 5.3.1, page 8) suggests the core algorithm is not robust enough to handle common sparse matrix structures on its own. This ad-hoc approach, which requires its own manually-tuned window parameter W, raises serious concerns about the framework's generality and robustness.

        3. Hidden Hyperparameter Tuning: The performance of RASSM appears highly sensitive to the granularity of the residue matrix (Rᵢ, Rⱼ), as shown in Figure 10 (page 13). Performance varies from a peak of 72.0 GFLOPS down to 56.8 GFLOPS depending on this choice—a 27% difference. The authors report the best performance at Rᵢ=64, Rⱼ=512, but there is no discussion of how these "optimal" parameters were selected or how a user is meant to determine them for a new matrix or architecture. This introduces a manual, meta-tuning step that directly contradicts the paper's claims of being an "automatic" tiling system.

        4. Inconsistent Performance and Insufficient Analysis of Slowdowns: The performance gains are inconsistent across the dataset. Figure 6 (page 10) shows numerous cases where RASSM underperforms the baselines (e.g., speedup < 1.0 against ASpT and MKL for a non-trivial fraction of the dataset). The authors' explanation—that this is due to matrices ideal for column reordering—is plausible but also highlights a key limitation: RASSM does not integrate with or outperform other established optimization classes for certain matrix types. This makes it a point solution, not a general one. A more thorough analysis of the specific structural properties of matrices that cause slowdowns is required.

        5. Flawed Assumption Regarding Cache Conflicts: The paper explicitly states (Section 5.2, page 8) that it "assumes the presence of cache interleaving schemes to mitigate set conflict misses." This is a significant oversimplification. The authors themselves acknowledge that diminished gains on the Intel platform are due to cache conflicts (Section 7.6, page 13). A robust tiling framework cannot simply assume away one of the primary challenges of irregular memory access. Relying on this assumption and then observing its failure in practice weakens the entire methodological foundation.


        Questions to Address In Rebuttal

        The authors must provide clear and convincing answers to the following questions.

        1. Overhead Justification: Please provide a quantitative characterization of the application scenarios (e.g., number of required re-executions of the kernel) where the 13.66x geomean overhead for temporal analysis (derived from Table 6) becomes amortized and advantageous. How does this compare to the overhead of competing techniques like MKL's inspector-executor model, which also has a "hint" phase?

        2. Residue Granularity: How should a user select the residue granularity parameters Rᵢ and Rⱼ? Please provide a detailed sensitivity analysis or a robust heuristic for selecting these values automatically based on matrix and machine properties. Without this, the "auto-tiling" claim is not credible.

        3. Comparison to Optimal: Given the greedy nature of the tile builder, can you provide any analysis, even on a small subset of matrices, comparing the generated tiling to a theoretical optimum or a tiling found via a more expensive search (e.g., dynamic programming) to quantify the quality of the greedy decisions?

        4. Band Detection Fragility: The band detection heuristic (Section 5.3.1) depends on a window parameter W. How sensitive is the performance to the choice of W? What happens when matrices have slightly irregular or broken bands that the current heuristic might misclassify? How does this not constitute another manual tuning parameter?

        5. Cache Conflict Assumption: Given that your results on the Intel platform demonstrate the failure of your assumption about cache conflicts, why should this methodology be considered robust? Please justify why a tiling strategy that is blind to cache mapping functions is a sound approach. What modifications would be necessary to make RASSM conflict-aware, and what would the performance impact be?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-02 17:25:29.242Z

            Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."


            Review Form: RASSM: Residue-based Acceleration of Single Sparse Matrix Computation via Adaptive Tiling

            Summary

            This paper introduces RASSM, a novel technique for generating static, input-adaptive 2D tiles for single sparse matrix kernels (SSMKs) like SpMM and SDDMM. The central contribution is the "residue matrix," a low-overhead, coarse-grained data structure that compactly represents the non-zero distribution of the input matrix. The residue matrix, augmented with bit-vectors for active rows and columns, serves as an efficient proxy for the full matrix. By analyzing these residues, RASSM performs spatial and temporal volume analysis to intelligently construct variable-sized tiles that maximize cache occupancy and data reuse. The approach is entirely software-based, operates statically (ahead of time), and is shown to deliver significant performance improvements over state-of-the-art industrial libraries (Intel MKL), academic auto-tuners (J-Stream), and advanced tiling schemes (ASpT, CSF-4) on modern multi-core CPUs.

            Strengths

            1. The Core Idea's Elegance and Practicality: The concept of the residue matrix is the paper's standout strength. It strikes an excellent balance between information fidelity and computational overhead. In the vast and complex landscape of sparse computations, finding a "signature" that is both cheap to compute and rich enough to guide optimization is a significant challenge. The residue matrix, capturing both local density (NNZ count) and structural distribution (active row/column bit-vectors), is a very clever and effective solution to this problem. It allows the tile generation algorithm to explore the design space without repeatedly traversing the massive original matrix.

            2. Excellent Positioning in the Design Space: The authors have identified and filled a crucial gap in the spectrum of sparse optimization techniques. RASSM exists in a "sweet spot":

              • It is more intelligent than rigid, uniform tiling schemes (like CSF-US/UO).
              • It avoids the extremely high training and runtime compilation overhead of complex machine learning-based approaches like WACO [39].
              • It is a static, software-only approach, making it more broadly applicable than hardware-dependent dynamic tiling methods like DRT [25].
                This makes RASSM a highly practical contribution for today's commodity hardware.
            3. Thorough and Convincing Evaluation: The experimental methodology is robust. The authors compare RASSM against a comprehensive suite of strong, relevant baselines on two different modern server architectures. The analysis goes beyond simple speedup numbers, delving into cache occupancy (Figure 8, page 12) and the sensitivity of the residue granularity (Figure 10, page 13). This provides valuable insight into why the method works, not just that it does. The performance gains over highly-optimized libraries like MKL are non-trivial and demonstrate real-world value.

            4. Generality and Potential for Integration: The RASSM framework is presented as being largely format-agnostic, as long as the format supports 2D tiling. The authors demonstrate this with both their custom ATM format and the more standard CSF-4. This flexibility is key for adoption. The concept is ripe for integration into tensor compilers like TACO, which could use residues as a basis for a new, powerful tiling optimization pass, automating a difficult task for domain scientists and programmers.

            Weaknesses

            While the work is strong, its relationship to adjacent areas could be clarified and explored further to strengthen its impact.

            1. Relationship with Matrix Reordering is Underexplored: The paper positions RASSM as an alternative to techniques like ASpT [12], which reorder matrix columns to improve locality. However, these two classes of optimization are largely orthogonal. Reordering physically restructures the matrix to create dense regions, while RASSM finds the best way to tile the given structure. A potentially more powerful approach would be to combine them: first, reorder the matrix to create better locality, and then apply RASSM to adaptively tile the newly structured matrix. A discussion of this potential synergy is a missed opportunity.

            2. Scalability of the Pre-processing Step: The analysis of overheads in Table 6 (page 12) is good, showing that the cost of residue generation and tiling is a small multiple of a single kernel execution for the tested matrices. However, as sparse problems in GNNs and scientific computing grow to billions of non-zeros, the scalability of this pre-processing step becomes a critical concern. A brief discussion on the asymptotic complexity of the residue and tile generation phases would strengthen the paper's claims of practicality for very large-scale problems.

            3. Clarity of "Temporal Volume Analysis": The paper distinguishes between "spatial" and "temporal" volume analysis. The temporal analysis, using the maximum overlapping intervals algorithm to find the peak number of concurrently active rows/columns, is an interesting refinement. However, this terminology is somewhat idiosyncratic to this paper and could be defined more clearly. The true impact of this more complex temporal analysis versus the simpler spatial analysis could also be isolated more directly to justify its added complexity and overhead.

            Questions to Address In Rebuttal

            1. Could the authors comment on the potential synergy of using RASSM in conjunction with, rather than as an alternative to, matrix reordering techniques like ASpT? Does applying RASSM to a pre-reordered matrix yield further benefits?

            2. The overheads presented in Table 6 seem manageable for the evaluated matrix sizes. Could the authors discuss the scalability and asymptotic complexity of the residue generation and tile-building process for matrices that are an order of magnitude larger?

            3. The concept of "temporal volume analysis" is a key refinement. Could the authors elaborate on the typical performance gain attributable specifically to this temporal analysis over the simpler spatial approach, and how this gain trades off against its higher computational overhead during the tiling phase?

            4. Looking forward, the residue matrix seems like a powerful and generalizable concept. Have the authors considered its applicability to other important sparse computations with more complex data reuse patterns, such as sparse-sparse matrix multiplication (SpGEMM)?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-02 17:25:39.793Z

                Review Form: The Innovator (Novelty Specialist)


                Summary

                The paper proposes RASSM, a static, input-dependent tiling framework for Single Sparse Matrix Kernels (SSMKs). Its primary claimed novelty lies in the introduction of a "residue matrix," a coarse-grained data structure that summarizes the non-zero distribution of a sparse matrix. This residue matrix is augmented with sparsity-pattern bit-vectors and temporal range information. The authors claim this structure enables an intelligent, two-phase (panel and tile) greedy generation of variable-sized 2D tiles that optimize for cache volume, leading to performance improvements over existing techniques.

                Strengths (Novelty-focused)

                The central contribution, the residue matrix, represents a novel evolution of prior "signature"-based approaches for sparse computation analysis. My analysis of the prior art confirms the following points of novelty:

                1. Richer Structural Representation: The proposed residue matrix is a 2D data structure that captures non-zero counts, active rows, and active columns within coarse-grained regions (Section 4, page 5). This is a qualitative improvement over the closest prior art in software-based auto-tiling, J-Stream [20], which employs a 1D signature ("largest vertical zero interval"). The RASSM representation is strictly more expressive, enabling a 2D-aware tiling strategy that J-Stream's 1D signature cannot inform.

                2. Novel Application of Summarization: While the idea of summarizing matrix structure exists (e.g., P-OPT [4] uses "residue-like metadata"), the application and formulation here are new. P-OPT uses its summary for dynamic cache replacement decisions in graph analytics. RASSM uses its specifically formulated residue matrix for static, ahead-of-time tile generation for matrix kernels. The structure itself, combining NNZ counts with active-row/column bit-vectors for volumetric analysis, is unique to this purpose.

                3. Distinction from Sampling/ML: The residue matrix is distinct from down-sampling techniques often used as input for machine learning models (e.g., WACO [39]). The residue is a deterministic, partially-lossy aggregation, not a stochastic sample. It preserves the total non-zero count, providing a different set of guarantees and information than a sampled representation would. This deterministic, model-driven approach to generating adaptive tiles is a novel alternative to purely learning-based methods.

                Weaknesses (Novelty-focused)

                While the core data structure is novel, some of the surrounding methodology builds upon well-established concepts, which slightly dilutes the overall novelty.

                1. Composition of Existing Concepts: The RASSM pipeline can be viewed as an elaborate composition of existing ideas applied to a new data structure. The tile generation itself is a greedy algorithm, which is a standard heuristic. The temporal analysis (Section 3.2, page 5) leverages the well-known maximum overlapping intervals algorithm [19]; the novelty is its application to the residue's temporal metadata, not the algorithm itself.

                2. Incremental Advance: The contribution is an evolutionary, not revolutionary, step. It advances the state of "signature-based" analysis from 1D (J-Stream) to 2D. While this is a significant and valuable engineering contribution that demonstrably improves performance, it follows a logical and foreseeable path of improvement rather than introducing a completely new paradigm for sparse optimization.

                3. Complexity vs. Justification: The proposed method introduces considerable pre-processing complexity. The generation of the residue matrix, especially the temporal version, has a non-trivial overhead (Table 6, page 12, shows temporal residue construction at 10.86x the kernel time). The performance gains, while consistent, are in the 1.1x-1.4x geomean range. For a one-shot execution, this trade-off is unfavorable. The novelty would be stronger if the benefits were more substantial relative to the complexity, or if the pre-processing cost were significantly lower. The authors correctly position this for repeated computations, but the cost-benefit analysis is a critical aspect of evaluating the novelty's practical significance.

                Questions to Address In Rebuttal

                1. The concept of a coarse-grained summary of a matrix is not entirely new. Please further elaborate on the specific delta between the "residue matrix" and the "summary matrices" used in P-OPT [4]. Beyond the difference in application (tiling vs. cache replacement), are there fundamental structural differences in the metadata captured that make the residue matrix uniquely suited for tile generation?

                2. The bit-vector augmentation (Section 4.1.1, page 6) adds significant storage and computational overhead to the residue analysis. Could a simpler 2D histogram of NNZ counts alone, without the active row/column bit-vectors, achieve a significant fraction of the performance? Please justify the novelty and necessity of this specific augmentation in the context of the final performance.

                3. The paper presents a choice between "Spatial" and "Temporal" analysis, where the latter is significantly more expensive but yields marginally better results (Table 5, page 11). Does the novelty primarily lie in the spatial analysis, with the temporal version being a standard technique applied on top? Or do you claim the formulation of the temporal analysis on residues is itself a key novel contribution?