No internet connection
  1. Home
  2. Papers
  3. ASPLOS 2025 V2

SmoothE: Differentiable E-Graph Extraction

By Karu Sankaralingam @karu
    2025-11-02 17:29:03.040Z

    E-
    graphs have gained increasing popularity in compiler optimization,
    program synthesis, and theorem proving tasks. They enable compact
    representation of many equivalent expressions and facilitate
    transformations via rewrite rules without phase ordering ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-02 17:29:03.568Z

        Paper Title: SmoothE: Differentiable E-Graph Extraction
        Review Form: The Guardian


        Summary

        The paper presents SmoothE, a novel method for e-graph extraction that reformulates the discrete combinatorial optimization problem into a continuous, differentiable form. The core idea is to relax the binary selection variables for e-nodes into continuous probabilities and then use gradient descent to optimize an objective function. This objective function combines a user-defined cost model with a continuous penalty term for acyclicity, derived from the NOTEARS method. The authors claim this approach is highly scalable, suitable for GPU acceleration, and uniquely capable of handling complex non-linear cost models. The method is implemented in PyTorch and evaluated against ILP and heuristic baselines on a range of realistic and synthetic e-graphs, demonstrating significant speedups over ILP while reportedly maintaining high solution quality.

        Strengths

        1. Novel Formulation: The central idea of casting e-graph extraction as a differentiable optimization problem is innovative. It allows the leveraging of the mature ecosystem of deep learning frameworks (e.g., PyTorch) for a classical compiler optimization problem, which is a commendable direction of research.

        2. Performance and Scalability: The experimental results demonstrate a substantial performance advantage over ILP-based methods. The ability to solve large extraction problems in seconds on a GPU (as shown in Table 2, page 9), where commercial ILP solvers time out or take minutes/hours, addresses a well-known and critical bottleneck in the practical application of equality saturation.

        3. Support for Non-Linear Cost Models: The framework’s ability to accommodate any differentiable cost function is a significant strength. Section 5.5 (page 10) shows a proof-of-concept with an MLP-based cost model where SmoothE outperforms other methods. This capability is a clear advantage over traditional ILP and heuristic methods, which are typically restricted to linear costs.

        Weaknesses

        My primary concerns with this work stem from a series of methodological approximations and a lack of formal guarantees, which undermine the claimed rigor of the approach. The pursuit of differentiability and speed appears to have come at the cost of correctness and clarity.

        1. Acyclicity is Not Guaranteed, Merely Penalized: The most significant flaw is the handling of the acyclicity constraint. The authors adopt the NOTEARS penalty function (Section 3.4, page 6), which provides a continuous, differentiable proxy for acyclicity. However, this transforms a hard, combinatorial constraint into a soft, tunable penalty. The paper provides no guarantee that the final, discretely extracted solution will be acyclic. The claim that a "sufficiently large λ" ensures this is a theoretical hand-wave without practical, verifiable substance. How is λ chosen? How sensitive are the results to it? What happens if the final graph contains a cycle? The framework would produce an invalid program. This is an unacceptable trade-off for a systems problem where correctness is paramount.

        2. Unjustified Assumptions in Probability Propagation: The method for computing marginal probabilities p from conditional probabilities cp (Section 3.3, page 5) relies on strong and likely incorrect assumptions. The options presented are that parent e-nodes are either fully independent (Eq. 6b) or fully positively correlated (Eq. 7). Real-world e-graphs contain complex, overlapping sub-expressions that make both of these assumptions highly suspect. The impact of these flawed assumptions on the accuracy of the computed probabilities, and thus on the final solution quality, is never analyzed. This calls into question whether the gradient descent is optimizing a faithful representation of the problem.

        3. The "Sampling" Step is Not Sampling: The description of the final extraction stage in Section 3.5 (page 7) is misleading. The paper states: "we select e-node nk ∈ mj with the largest probability cpk". This is not sampling; it is a deterministic greedy decoding of the optimized probabilities. True sampling would involve a stochastic process (e.g., drawing from the probability distribution), which would allow for broader exploration but is not what is described. This terminological inaccuracy obscures the deterministic and potentially myopic nature of the final solution selection process.

        4. Weak Evaluation of the Primary Non-Linearity Claim: A key selling point is the ability to handle non-linear cost models. However, the evaluation in Section 5.5 (page 10) is unconvincing. The primary baseline, "ILP*", is defined as the oracle solution for the linear cost model, which completely ignores the non-linear objective. This is not a valid baseline; it is a reference point from a different problem. The only other baseline, a genetic algorithm, is a heuristic whose quality is highly implementation-dependent. To validate such a strong claim, the authors should have compared against established methods for mixed-integer non-linear programming or, at minimum, been transparent about the lack of a strong baseline.

        5. Unexamined Impact of Performance Optimizations: In Section 4.3 (page 8), the authors introduce a "Batched Approximation" for the matrix exponential calculation to improve performance. They approximate the sum of matrix exponentials with the exponential of an average matrix (tr(e^ΣA) ≈ Σtr(e^A) is not a valid identity, and the paper actually implements Σtr(e^A[i]) ≈ tr(e^(ΣA[i]))). This is a non-trivial mathematical shortcut taken for performance reasons. The effect of this approximation on the accuracy of the acyclicity penalty, and consequently on solution quality, is not measured or discussed. This is another instance of sacrificing rigor for speed without justification.

        Questions to Address In Rebuttal

        The authors must address the following points to convince me of the validity and soundness of their work:

        1. Acyclicity Guarantee: Can you provide a formal proof or, failing that, a thorough empirical validation (e.g., by checking for cycles in 100% of the extracted solutions across all experiments) that your method with the NOTEARS penalty always produces an acyclic graph? Please provide a principled method for setting the hyperparameter λ and analyze the sensitivity of the final solution's validity to this choice.

        2. Probability Model Justification: Please justify the strong independence or full-correlation assumptions used for probability propagation in Section 3.3. Can you conduct an ablation study on smaller e-graphs, comparing your propagation method to a ground-truth calculation (e.g., via Junction Tree), to quantify the error introduced by these simplifying assumptions?

        3. Methodological Clarity: Please clarify the terminology in Section 3.5. Is the final extraction step a deterministic greedy selection based on the optimized probabilities, or is it a stochastic sampling process? If the former, please correct the term "sampling" throughout the paper to "greedy decoding" or a similar, more accurate phrase.

        4. Impact of Approximations: Please provide an ablation study that quantifies the impact of the matrix exponential approximation described in Section 4.3 on final solution quality. How does the quality (and validity, re: acyclicity) of the solution change when this approximation is disabled?

        5. Non-Linear Baseline: To substantiate the claims made in Section 5.5, can you either defend the use of a linear-cost oracle ("ILP*") as a meaningful baseline for a non-linear problem or propose and evaluate against a more appropriate baseline for non-linear combinatorial optimization?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-02 17:29:14.121Z

            Reviewer Persona: The Synthesizer (Contextual Analyst)

            Summary

            This paper introduces SmoothE, a novel approach to the e-graph extraction problem that reframes it from a discrete, combinatorial optimization task into a continuous, differentiable one. The core idea is to relax the binary decision of selecting an e-node into a probabilistic choice, represented by continuous variables. This transformation enables the use of gradient descent to optimize for a given cost function, allowing the entire process to be implemented in a standard machine learning framework (PyTorch) and accelerated on GPUs. The authors demonstrate that this method can handle complex, non-linear cost models—a significant limitation of prior work—while achieving a highly competitive trade-off between solution quality and runtime, often approaching the quality of Integer Linear Programming (ILP) solvers in a fraction of the time.

            Strengths

            The most significant contribution of this work is the elegant synthesis of ideas from several distinct research domains to solve a long-standing problem in compilers and program synthesis.

            1. A Fundamental Paradigm Shift: The paper's central premise—making e-graph extraction differentiable—is a powerful and compelling one. It moves the problem out of the realm of specialized combinatorial solvers (like ILP) and heuristics, and into the rich, mature, and highly-optimized ecosystem of deep learning. This is not an incremental improvement but a conceptual leap that opens up entirely new possibilities.

            2. Unlocking Sophisticated Cost Models: The ability to support any differentiable cost function is arguably the most impactful aspect of this work. For years, the community has been constrained by linear cost models that fail to capture complex interactions, such as resource clustering in hardware synthesis or cache effects in software. SmoothE provides a principled framework for incorporating learned, data-driven cost models (as demonstrated with the MLP experiment in Section 5.5, page 10). This directly connects the powerful predictive capabilities of machine learning with the symbolic reasoning of equality saturation, a connection many have sought but struggled to realize effectively.

            3. Elegant Application of Cross-Domain Techniques: I was particularly impressed by the authors' application of the NOTEARS method (from causal discovery literature) to enforce the acyclicity constraint (Section 3.4, page 6). This is a perfect example of recognizing a structural analogy between two seemingly disparate problems (learning a DAG vs. extracting a DAG) and leveraging a state-of-the-art solution. Similarly, framing the probability propagation as a belief propagation problem on a graphical model (Section 3.3, page 5) shows a deep understanding of the problem's underlying structure. This work serves as a wonderful case study in how ideas from one field can be transformative when applied in another.

            4. Excellent Practical Results and Scalability: By leveraging GPUs, SmoothE demonstrates impressive performance. The results in Table 2 (page 9) show that it can consistently find high-quality solutions (often comparable to or better than a 15-minute ILP run) in mere seconds. This makes it a practical tool for real-world compilation and synthesis workflows where long solver times are unacceptable.

            Weaknesses

            While the core idea is brilliant, its current instantiation relies on certain assumptions and introduces new complexities that could be explored further.

            1. Ad-hoc Nature of Probability Propagation: In Section 3.3, the authors propose several assumptions for propagating probabilities from parent e-nodes to their child e-classes (independence, full correlation, or a hybrid average). While pragmatic, this feels like the least principled part of the framework. A more formal treatment based on loopy belief propagation or other variational inference techniques might yield a more robust and theoretically grounded model, though likely at the cost of complexity. The current "hybrid" approach, while effective, feels like a heuristic embedded within an otherwise elegant mathematical framework.

            2. Introduction of New Hyperparameters: The shift to a gradient-based optimization framework replaces the well-understood behavior of solvers with the more opaque world of hyperparameters (e.g., learning rate, optimizer choice, the acyclicity penalty weight λ, patience). The paper mentions using a "simple grid search" (Section 5.1, page 8), but the sensitivity of the final solution quality to these parameters is not deeply explored. A potential barrier to adoption could be the perception that the system requires expert tuning for each new domain or dataset.

            3. Soft vs. Hard Constraints: The NOTEARS penalty for acyclicity is a soft constraint; it drives the probability of a cyclic solution to zero but does not offer the hard guarantee of an ILP formulation. For correctness-critical applications, this distinction matters. While the sampling process may ultimately produce an acyclic graph, the optimization itself is navigating a space where cyclic solutions are possible.

            Questions to Address In Rebuttal

            1. Could the authors elaborate on the choice of the probability propagation model? What was the intuition behind the "hybrid" model, and did you experiment with more formal belief propagation update rules? How much does the choice of this assumption (independent, correlated, hybrid) impact the final solution quality across the different datasets?

            2. Can you provide more insight into the robustness of SmoothE with respect to its hyperparameters? Specifically, how sensitive is the solution quality to the acyclicity penalty λ and the choice of optimizer/learning rate? Does a single set of hyperparameters generalize well across an entire application domain (e.g., all tensat e-graphs), or is per-instance tuning required for optimal performance?

            3. The connection to ML-based cost models is a key strength. Have you considered a tighter integration where the weights of the cost model itself could be co-optimized alongside the extraction probabilities? This could lead to a powerful end-to-end learning system where the cost model learns to guide the extractor toward regions of the search space it knows are promising.

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-02 17:29:24.614Z

                Reviewer: The Innovator (Novelty Specialist)

                Summary

                The paper introduces SmoothE, a novel framework for e-graph extraction that recasts the discrete optimization problem into a continuous, differentiable one. The core idea is to relax the binary e-node selection variables into continuous probabilities. This relaxation allows the entire extraction process to be optimized via gradient descent. The authors handle the required completeness and acyclicity constraints by adapting techniques from other domains: a Loopy Belief Propagation-style mechanism for ensuring completeness and the NOTEARS framework for enforcing acyclicity as a soft penalty. The primary claimed benefits are the ability to handle complex, non-linear cost functions and efficient execution on parallel hardware like GPUs.

                Strengths

                The primary strength of this paper is the novel formulation of e-graph extraction. To my knowledge, this is the first work to successfully apply end-to-end differentiable programming and continuous optimization techniques to this specific problem.

                1. Enabling Non-Linear Cost Models: The most significant contribution stemming from this novelty is the native support for arbitrary differentiable, non-linear cost functions. Prior art is almost exclusively limited to linear cost models for ILP/SAT solvers or heuristic approaches. The ability to directly integrate, for instance, a neural network-based cost model into the optimization loop (as demonstrated in Section 5.5, page 10) is a genuine advancement that opens up new research directions for co-optimizing program representations and learned performance models. This is a clear and valuable delta over existing work.

                Weaknesses

                While the application to e-graph extraction is new, the core conceptual machinery employed by the authors is not fundamentally novel. The paper's contribution lies in the synthesis and adaptation of several existing techniques to a new problem domain, rather than the invention of a new core optimization algorithm.

                1. Conceptual Overlap with Differentiable Architecture Search (DARTS): The central paradigm of relaxing a discrete choice problem on a directed graph to a continuous, gradient-optimizable one is well-established. The most prominent example is Differentiable Architecture Search (DARTS) [Liu et al., ICLR 2019]. In DARTS, the choice of an operation on an edge of a computation graph is relaxed into a softmax over all possible operations. This is conceptually identical to SmoothE's relaxation of e-node selection within an e-class. While the specific constraints of e-graphs differ from those of neural architecture search, the foundational idea of "relax-and-descend" is the same. The paper does not acknowledge or differentiate itself from this significant body of prior art.

                2. Reliance on Off-the-Shelf Components for Constraints: The novelty of the formulation is diluted by the fact that the two hardest constraints (completeness and acyclicity) are handled by adapting existing, well-known methods.

                  • Acyclicity: The handling of acyclicity is a direct application of the NOTEARS framework [57], as explicitly stated by the authors in Section 3.4 (page 6). The theorem and the resulting penalty function h(At) are lifted directly from this prior work on causal discovery. The novelty is merely in identifying that this function could be applied to the e-class dependency graph.
                  • Completeness: The probability propagation scheme described in Section 3.3 (page 5) is presented as a bespoke solution, but it is heavily inspired by (and described as an adaptation of) Loopy Belief Propagation (LBP) [28, 31], a classic algorithm for inference in probabilistic graphical models. The use of different ad-hoc assumptions (independence, fully correlated, hybrid) further suggests that this component is more of a well-chosen heuristic than a fundamentally new, principled mechanism.

                In summary, the work is less a breakthrough in optimization and more a clever and effective piece of engineering that combines existing differentiable components to solve a new problem. This is a valid contribution, but its degree of foundational novelty should not be overstated.

                Questions to Address In Rebuttal

                1. The core idea of relaxing discrete choices in a graph structure for gradient-based optimization is central to the field of Differentiable Architecture Search (DARTS). Could the authors please explicitly differentiate their contribution from the DARTS paradigm? What are the key technical challenges in applying this idea to e-graphs that make the contribution non-obvious and significant?

                2. The paper presents several assumptions for computing e-class probabilities in Section 3.3 (page 6), culminating in a "hybrid" approach used by default. This seems heuristic. Is there a more formal or theoretical justification for this specific formulation? How sensitive is the final solution quality to the choice of this probability propagation scheme (independent vs. correlated vs. hybrid), and does this sensitivity undermine the robustness of the proposed method?

                3. The NOTEARS penalty term h(At) is known to be computationally expensive due to the matrix exponential, which is a core reason for the optimizations in Section 4.3 (page 8). Does this penalty term, even when optimized, present a fundamental scalability bottleneck? How does its asymptotic complexity compare to modern ILP solvers on the graph structures encountered in practice? It seems plausible that for some graph topologies, the overhead of computing this dense penalty could exceed the benefits of the continuous relaxation.