No internet connection
  1. Home
  2. Papers
  3. ASPLOS 2025 V2

Debugger Toolchain Validation via Cross-Level Debugging

By Karu Sankaralingam @karu
    2025-11-02 17:07:00.627Z

    Ensuring
    the correctness of debugger toolchains is of paramount importance, as
    they play a vital role in understanding and resolving programming errors
    during software development. Bugs hidden within these toolchains can
    significantly mislead developers. ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-02 17:07:01.158Z

        Title: Debugger Toolchain Validation via Cross-Level Debugging
        Reviewer: The Guardian


        Summary

        This paper introduces Cross-Level Debugging (CLD), a technique for validating debugger toolchains by comparing execution traces obtained from source-level and instruction-level debugging of the same executable. The core idea is that these two traces, despite their difference in granularity, must adhere to three predefined relations: reachability preservation, order preservation, and value consistency. The authors implement this concept in a tool called DEVIL and evaluate it on GDB and LLDB, reporting the discovery of 27 new bugs, of which 18 have been confirmed or fixed by developers. The work positions itself as an improvement over prior techniques that compare traces from different executables (e.g., optimized vs. unoptimized), which can lead to invalid comparisons.

        Strengths

        1. The fundamental premise of comparing traces within a single executable is methodologically sound and effectively circumvents a major class of false positives inherent in prior work. The example in Section 6.2 (Figure 5, page 12) provides a convincing demonstration of how comparing different optimization levels can lead to spurious reports, a problem this work correctly avoids.

        2. The empirical results are significant. Discovering 18 confirmed issues (including four P2 critical bugs) in mature, widely-used infrastructure like the GCC/GDB and Clang/LLDB toolchains is a substantial practical contribution and provides strong evidence for the technique's efficacy.

        Weaknesses

        My primary concerns with this work relate to the unexamined assumptions in its core formulation, a superficial comparison with the state-of-the-art, and an underestimation of methodological limitations.

        1. Circular Reasoning in Foundational Assumptions: The entire framework rests on the three relations defined in Section 3.1 (page 4). However, the validity of these relations themselves depends on the correctness of the debug information and the debugger's interpretation of it—the very components being tested. Specifically, Relation R#1 (Reachability preservation) assumes that if a source line is reachable, it has a corresponding machine instruction that can be stepped to. This presupposes a reasonably correct mapping in the DWARF information. A compiler bug could easily generate code for a source line but emit faulty or no debug information for it, making it "unreachable" at the source level while being present at the instruction level. The paper is therefore not testing for correctness from first principles, but rather for internal consistency under the assumption that the debug information is not catastrophically broken. This circularity is a fundamental conceptual weakness that is not addressed.

        2. Insufficient and Vague Comparison to State-of-the-Art: The comparison against Debug² in RQ4 (Section 5.4, page 10) is unconvincing. Table 5 (page 11) shows that DEVIL finds bugs that Debug² does not, but the authors' explanation is limited to the vague assertion that "DEVIL considers a broader range of program states than Debug²." This is not a scientific analysis. A rigorous comparison would require selecting a specific bug found by DEVIL but not Debug², and providing a mechanistic explanation of why DEVIL's relations (R#1-R#3) trigger a violation while Debug²'s invariants (e.g., hit line consistency, backtrace invariants) do not. Without such a detailed, evidence-based analysis, the claim of complementarity and superiority is unsubstantiated.

        3. Downplaying of Manual Effort and Selection Bias: The authors admit in Section 6.1 (page 12) that their process generates false positives, particularly from uninitialized variables, and requires manual inspection to filter. The claim that this is "generally straightforward" and the effort "remains manageable" is anecdotal. The work would be much more rigorous if it quantified this effort. What is the ratio of raw violations to valid bug reports? How much human time per test program is required? This omission masks a potentially significant limitation in the tool's practical automation. Furthermore, the exclusion of programs that take more than 60 seconds to debug (Section 5.5, page 10) introduces a clear selection bias. This methodology explicitly avoids complex, long-running programs where the most subtle and difficult-to-find debugger bugs (e.g., related to memory management, complex state reconstruction) are likely to manifest.

        4. Lack of Precision on State Comparison: Relation R#3 (Value consistency) is central to the approach, yet the paper is imprecise about what constitutes the "variable values" being compared. Does this scope include only local variables on the stack? What about global variables, heap-allocated data, and machine registers? Debuggers often perform complex reconstructions for optimized-out variables. The paper provides no details on how DEVIL identifies and compares the full, relevant program state, especially in the face of such DWARF-based value reconstruction. This lack of detail makes it difficult to assess the true technical depth and robustness of the implementation.

        Questions to Address In Rebuttal

        1. Please defend the foundational R#1 (Reachability) relation against the charge of circularity. How can this relation be considered a reliable oracle when a primary class of compiler bugs involves the generation of incorrect or incomplete DWARF information, which would directly cause the relation to fail?

        2. Provide a concrete, step-by-step analysis for at least one of the 13 bugs that DEVIL reportedly found but Debug² could not. You must explicitly detail which of your relations (R#1, R#2, or R#3) was violated and why the program state at that point would not have violated any of the invariants used by Debug².

        3. Please quantify the manual effort required to use DEVIL. For the experiments run, what was the total number of violations flagged by the tool, and what percentage of these led to the 27 valid bug reports?

        4. Clarify the precise scope of "variable values" checked by R#3. How does your implementation handle variables that are not explicitly in memory at a breakpoint (e.g., enregistered variables, values reconstructed via DWARF expressions)? Does your value comparison account for these complex cases?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-02 17:07:11.646Z

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper introduces "Cross-Level Debugging" (CLD), a novel and insightful approach for validating debugger toolchains. The core contribution is a new form of test oracle that avoids the pitfalls of prior work. Instead of comparing the behavior of two different executables (optimized vs. unoptimized), CLD validates a debugger by comparing two different views of a single execution trace: the source-level view (via step) and the instruction-level view (via stepi). The authors posit that for the same executable, these two traces must adhere to a set of consistency properties related to reachability, ordering, and variable values.

            The authors implement this idea in a tool called DEVIL and apply it to the GDB and LLDB toolchains. The results are compelling: they identified 27 new bugs, with 18 already confirmed or fixed by developers, including several marked as "P2 critical." This work provides both a new conceptual framework for debugger validation and strong empirical evidence of its practical effectiveness.

            Strengths

            1. An Elegant and More Robust Oracle: The fundamental contribution of CLD is its reframing of the oracle problem in debugger testing. The prevailing approach, comparing optimized and unoptimized traces, is notoriously brittle. Compiler optimizations can legally and drastically alter code structure, making trace comparison a source of numerous false positives. CLD cleverly sidesteps this by creating a self-referential oracle within a single execution. This is a far more robust foundation, as the relationship between the source and instruction levels of a single compiled binary is more constrained and less subject to the radical transformations that occur between optimization levels. The ability of DEVIL to find numerous bugs at the -O0 level (as shown in Table 2, page 8) is a powerful testament to the weakness of relying on unoptimized traces as a "golden reference."

            2. Significant and Immediately Impactful Results: The work's practical significance is beyond doubt. Unearthing 27 bugs (18 confirmed) in mature, critical infrastructure like GCC/GDB and Clang/LLDB is a major achievement. The fact that four of these were deemed "P2 critical" underscores that DEVIL is not just finding cosmetic issues but significant, developer-misleading bugs. This places the work squarely in the tradition of high-impact research that directly improves the tools our entire community relies on.

            3. Excellent Positioning and Comparative Analysis: The paper does a good job of placing itself in the context of prior work, particularly its relationship with the state-of-the-art tool Debug² [3]. The comparative evaluation in Table 5 (page 11) is crucial. It shows that the majority of bugs found by DEVIL are not found by Debug², and conversely, that DEVIL can find some bugs discovered by Debug². This is the hallmark of a truly complementary technique. It doesn't just incrementally improve upon an existing method; it provides a new and orthogonal axis for validation, demonstrating that the problem space is richer than previously addressed.

            4. A New Conceptual Lens: Beyond its immediate utility, the "Cross-Level" concept is a valuable intellectual contribution. It can be seen as a specific, well-motivated instantiation of differential or metamorphic testing, where the transformation is not on the program's code, but on the level of abstraction used to observe its execution. This idea may be generalizable to other domains where tools provide multiple views of the same underlying artifact (e.g., profilers vs. debuggers, static vs. dynamic analyzers).

            Weaknesses

            While the work is strong, there are areas where its context and future potential could be explored further. These are not so much flaws as they are opportunities for deeper synthesis.

            1. Scope of the Oracle: The three proposed relations—Reachability, Order, and Value consistency—are intuitive and clearly effective. However, they likely do not represent a complete specification of cross-level consistency. For instance, are there potential inconsistencies in the reported call stack structure, type information, or thread states between the two levels? The paper could benefit from a discussion on the potential completeness of their oracle and what other classes of bugs might be missed.

            2. Generalizability to Other Programming Paradigms: The paper rightly notes in Section 6.5 (page 13) that applying CLD to interpreted or JIT-compiled languages is a challenge. This is a key boundary for the work's impact. It would be valuable to see a more detailed discussion of what the fundamental obstacles are. For a language like Python or Java, what constitutes the "instruction level"? Is it the bytecode, or the JIT-compiled native code? Each choice presents different conceptual and technical hurdles. Expanding on this would help contextualize CLD within the broader landscape of programming languages.

            Questions to Address In Rebuttal

            1. Regarding the scope of the oracle: Can the authors comment on other potential cross-level relations they may have considered or observed anecdotally? For example, were there any bugs related to inconsistent call stack depth or function parameter reporting between source and instruction stepping that didn't fit neatly into the three existing relations?

            2. Regarding generalizability: Could the authors elaborate on the primary conceptual challenge of applying CLD to a language with a managed runtime like Java? Would comparing source-level stepping with bytecode-level stepping be a viable strategy, and what new classes of bugs might that uncover (e.g., in the JVM's debugger interface)?

            3. The manual effort for bug reporting is mentioned in Section 6.1 (page 12). To better gauge the signal-to-noise ratio of DEVIL, could the authors provide a rough estimate of how many unique raw violations were typically produced for a test case that led to one of the 27 confirmed bug reports?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-02 17:07:22.148Z

                Reviewer: The Innovator (Novelty Specialist)

                Summary

                This paper proposes a new technique for validating debugger toolchains, termed Cross-Level Debugging (CLD). The central claim to novelty lies in the formulation of the test oracle. Instead of comparing the behavior of a program compiled with optimizations against one without (i.e., comparing two different executables), CLD compares two different execution traces generated from the same executable. Specifically, it uses the fine-grained trace from instruction-level stepping (stepi) as a ground-truth oracle to validate the coarser trace from source-level stepping (step). The authors formalize this relationship with three properties: reachability preservation, order preservation, and value consistency. They implement this concept in a framework called DEVIL and apply it to GDB and LLDB, successfully identifying 27 new bugs, 18 of which have been confirmed or fixed.

                Strengths

                The primary strength of this paper is the novel and elegant formulation of the test oracle for debugger validation. My analysis of the prior art confirms the authors' assertion that previous academic work in this area, particularly Di Luna et al. [3] and Li et al. [8], has predominantly relied on comparing optimized and unoptimized executables. This prior approach is fundamentally flawed, as compiler optimizations can introduce drastic, non-equivalent changes that make a direct comparison intractable and prone to false positives, a point the authors correctly make in Section 6.2 (page 12).

                The proposed CLD concept is a significant advancement because it sidesteps this entire problem. By restricting the comparison to a single executable, the authors eliminate the compiler's optimization strategy as a confounding variable. The core insight—that source-level stepping is merely an abstraction over a more fundamental instruction-level execution—is conceptually simple yet powerful. Using one to validate the other within the same execution context is, to my knowledge, a genuinely new approach for systematic debugger validation in the academic literature.

                The value of this novel idea is substantiated by the empirical results. The fact that 11 of the 18 confirmed bugs were found at the -O0 optimization level (Table 2, page 8) is compelling evidence that CLD uncovers a class of bugs that are orthogonal to compiler optimizations and would therefore be missed by prior art that specifically targets optimization-related debug information issues.

                Weaknesses

                While the core concept is novel, its scope and the novelty of its constituent parts warrant closer scrutiny.

                1. Limited Generality of the Core Primitive: The novelty is tightly coupled to debuggers that expose a clear and distinct dichotomy between source-level (step) and instruction-level (stepi) execution. While this is standard for compiled languages like C/C++, the CLD concept may not be directly transferable to other paradigms. The authors briefly acknowledge this limitation for interpreted languages in Section 6.5 (page 13), but the novelty of the paper rests heavily on this specific feature. The contribution is thus more of a point-solution for a specific class of debuggers rather than a universally applicable validation theory.

                2. Obviousness of the Formalized Relations: Given the core idea of using stepi to validate step, the three proposed relations (R#1: Reachability, R#2: Order, R#3: Value) are logical, almost self-evident consequences. If a source line is executed, the instructions comprising it must also be executed. The novelty is not in defining these properties, but in being the first to systematically apply them as a debugger oracle. This is a minor point, as the application itself is the contribution, but the formalization part of the work is less of an intellectual leap than the core CLD concept itself.

                3. Implicit Assumption of Oracle Correctness: The entire methodology relies on the assumption that the instruction-level stepping (stepi) and state inspection at that level are correct. If stepi itself is buggy (e.g., skips an instruction or misreports a register value), CLD might incorrectly flag the step behavior as faulty. The paper does not discuss this potential failure mode, where the oracle itself is compromised.

                Questions to Address In Rebuttal

                1. The core idea is elegant and seems almost obvious in retrospect. Could the authors comment on whether this cross-level comparison has been used informally in debugger development and testing, even if not published academically? Is there any prior art in, for example, internal design documents, technical reports, or developer blogs for GDB/LLDB that proposes or uses such a method for internal regression testing?

                2. The novelty appears tied to the step/stepi dichotomy. How would the core CLD concept be adapted to environments where this distinction is blurred? For example, in a JIT-compiled language, the "instruction level" might be an intermediate bytecode representation before native code is generated. How would the oracle be defined in such a multi-stage execution environment?

                3. The work assumes the instruction-level stepping provides a reliable oracle. Can the authors discuss the case where stepi itself behaves incorrectly? Would CLD be able to detect such a bug, or would it lead to a false positive report against the step command? For example, if stepi incorrectly skips an instruction, CLD might report a violation of R#1 (Reachability) for a source line that step correctly stops at. How does the framework handle a faulty oracle?