No internet connection
  1. Home
  2. Papers
  3. ASPLOS 2025 V2

Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads

By Karu Sankaralingam @karu
    2025-11-02 17:30:07.342Z

    GPU
    underutilization is a significant concern in many production deep
    learning clusters, leading to prolonged job queues and increased
    operational expenses. A promising solution to this inefficiency is GPU
    sharing, which improves resource utilization by ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-02 17:30:07.912Z

        Paper: Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads
        Reviewer: The Guardian


        Summary

        The paper presents Tally, a non-intrusive GPU sharing system designed to provide strong performance isolation for high-priority (latency-critical) workloads when co-located with low-priority (best-effort) tasks. The central mechanism involves intercepting GPU API calls and applying two types of kernel transformations—slicing and preemption—at the PTX level. These transformations enable fine-grained, block-level scheduling, which the authors argue is necessary to meet the strict tail latency requirements of inference tasks. The system employs a profile-guided scheduler to dynamically select the optimal transformation and configuration for best-effort kernels to minimize interference. The evaluation claims to show that Tally imposes a mere 7.2% average overhead on the P99 latency of high-priority tasks, a significant improvement over prior art, while retaining approximately 80% of the throughput of the state-of-the-art TGS system.

        While the motivation is sound, the paper's core claims rest on a set of kernel transformations whose robustness is not sufficiently proven and an evaluation that appears to obscure significant performance costs.

        Strengths

        1. Problem Motivation: The paper does an excellent job of motivating the problem. The analysis in Section 3, particularly Table 1, effectively illustrates why coarse-grained (iteration- or kernel-level) scheduling is fundamentally inadequate for co-locating workloads with millisecond-scale latency targets. This provides a strong foundation for the paper's core thesis.

        2. Core Insight: The central argument that scheduling must occur at a granularity finer than the kernel level is well-defended. The performance decomposition in Figure 7(b) provides clear evidence that simply adding priority to a kernel-level scheduler (Scheduling w/o Transformations) is insufficient and that the block-level mechanisms are indeed the primary source of the claimed isolation.

        3. Experimental Design: The evaluation setup is comprehensive. The choice of workloads covers a reasonable spectrum of modern DL models, and the use of a production-grade server (A100 GPU) and realistic traffic patterns (MAF2 trace) lends credibility to the experimental environment. The set of baselines, including MPS, MPS-Priority, and TGS, is appropriate for a state-of-the-art comparison.

        Weaknesses

        1. Unsubstantiated Robustness of Kernel Transformations: The paper's entire premise hinges on the ability to safely and universally transform arbitrary GPU kernels. The "unified synchronization transformation" (Section 4.1, Figure 3b) is presented as a panacea for ensuring safe preemption by preventing synchronization divergence. However, the paper provides no formal proof or rigorous empirical evidence of its correctness across a wide range of complex kernels. Modern DL frameworks and libraries like cuDNN generate highly complex PTX code with intricate control flow, register usage, and shared memory access patterns. It is highly plausible that there exist kernels for which this transformation is either functionally incorrect or introduces prohibitive performance overhead. The claim of safe, automatic application is a significant one that requires much stronger validation than a small set of benchmark workloads.

        2. Obscured Overhead of Best-Effort Tasks: The paper buries a critical performance detail in Section 5.7: the kernel transformation itself imposes an average overhead of 25% on the best-effort kernels. This is a substantial penalty. However, the primary results in Figure 5 report "System Throughput," a normalized metric that sums the throughput of both jobs. This normalization conveniently masks the true cost imposed on the low-priority job. A 25% slowdown is a severe price to pay for co-location, and the paper's presentation minimizes this crucial trade-off. TGS, for all its latency faults, may be providing much better performance for the best-effort job, a detail that is not clear from the presented data.

        3. Practicality of the Online Profiling Mechanism: The priority-aware scheduler (Section 4.2) relies on online profiling to select launch configurations. The paper claims the overhead is "negligible" because measurements are reused (Section 5.7). This assumption is fragile in production environments. Workloads with dynamic shapes or Just-In-Time (JIT) compilation (common in PyTorch 2.0 via TorchInductor, which is used in the benchmarks) can generate a vast number of unique kernel configurations. The paper fails to quantify the latency of profiling a new kernel configuration and its impact on the system. If a new, long-running kernel from a best-effort job arrives, does the system stall high-priority work while it profiles it? Or does it use a suboptimal default, potentially violating the latency SLO? The methodology lacks rigor here.

        4. Dismissal of Critical Edge Cases (Untransformable Kernels): In Section 6, the paper admits that kernels using recent CUDA extensions like Cooperative Groups cannot be transformed and that "Tally refrains from applying block-level scheduling" for them. This is a critical vulnerability in the isolation guarantee. If a best-effort workload submits even a single long-running, untransformable kernel, the system reverts to coarse-grained, non-preemptive scheduling, and all the latency benefits of Tally are lost for the duration of that kernel. The paper dismisses this by noting that "none of the workloads [in their evaluation] employ" them, which is not a sufficient defense for a system proposed for general production use. The prevalence of such kernels in libraries like cuBLAS or framework-generated code for complex reductions needs to be addressed.

        Questions to Address In Rebuttal

        1. On Transformation Robustness: The claim of safe, automatic kernel transformation is the most critical and least substantiated one in the paper. Can the authors provide evidence of the transformation's correctness beyond the 12 models in the benchmark? For instance, have they applied it to a large, diverse corpus of kernels extracted from other production applications or pathological microbenchmarks designed to stress-test complex control flow and synchronization patterns?

        2. On the True Cost to Best-Effort Tasks: Please provide absolute, non-normalized throughput data for the best-effort training workloads when co-located with inference tasks. How does the 25% transformation overhead (Section 5.7) manifest in these absolute numbers, and how does the throughput of a low-priority Tally job compare to that of a low-priority TGS job?

        3. On Profiling Overhead: Please quantify the "transparent profiler's" performance. Specifically, what is the end-to-end latency for profiling a single, previously unseen kernel configuration? In a scenario with frequently changing kernel configurations (e.g., dynamic batching), how often does this profiling occur, and what is the cumulative impact on the P99 latency of the high-priority task?

        4. On Untransformable Kernels: What is the system's precise fallback behavior when a best-effort task submits a kernel that cannot be transformed (e.g., one using Cooperative Groups)? Does the scheduler block this kernel until the GPU is idle, or does it run it non-preemptively? In the latter case, please provide data on the worst-case latency impact this would have on a high-priority task. Can you provide an analysis of how common such kernels are in the latest versions of cuDNN or other key NVIDIA libraries?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-02 17:30:18.552Z

            Review Form: The Synthesizer (Contextual Analyst)


            Summary

            This paper presents Tally, a non-intrusive GPU sharing system designed to provide strong performance isolation for high-priority, latency-sensitive workloads when co-located with best-effort tasks. The core problem it addresses is a critical trade-off in modern ML clusters: high GPU utilization is desired for cost efficiency, but sharing a GPU often leads to unpredictable performance interference, violating the strict service-level objectives (SLOs) of production inference services.

            The central contribution of Tally is a novel, task-agnostic mechanism that achieves fine-grained, block-level scheduling control over GPU execution without requiring any changes to application source code or ML frameworks. It accomplishes this by intercepting GPU API calls and performing on-the-fly transformations of kernel device code (PTX for NVIDIA GPUs). Specifically, it introduces two primitives: "slicing," which breaks large kernels into smaller, independently schedulable sub-kernels, and "preemption," which transforms kernels into a persistent, iterative style that can be interrupted and resumed. A profile-guided, priority-aware scheduler then uses these primitives to ensure high-priority tasks are executed promptly while opportunistically filling idle GPU cycles with best-effort work. The evaluation is comprehensive, demonstrating that Tally maintains near-ideal tail latency for inference tasks (average 7.2% overhead) while achieving system throughput comparable to state-of-the-art, throughput-focused systems like TGS (achieving over 80%).


            Strengths

            1. Excellent Problem Contextualization and Motivation: The paper does a superb job of situating itself within the existing landscape of GPU sharing solutions. In Section 3 ("GPU Sharing in the Wild," page 4), the authors clearly articulate the limitations of prior art, categorizing their failings into three well-defined issues: high integration cost, lack of performance isolation, and reliance on narrow workload characteristics. This framing effectively carves out a well-motivated niche for Tally as a solution that aims to be simultaneously non-intrusive, isolating, and general.

            2. A Powerful and Practical Core Idea: The central mechanism of automatic, transparent PTX transformation to enable block-level preemption and slicing is both elegant and highly effective. This approach successfully synthesizes ideas from different corners of the field. While concepts like persistent thread blocks (PTB) or kernel preemption have been explored before (e.g., in Effisha or REEF), those systems required source code access or relied on workload-specific properties like idempotency. Tally's key innovation is to make this fine-grained control universally applicable and completely transparent by operating at the device-code level. This is a significant step towards providing true OS-like preemptive multitasking for GPUs.

            3. Bridging the Gap Between Conflicting Goals: The most significant impact of this work is its demonstration of a superior Pareto frontier for the conflicting goals of performance isolation (low latency) and system utilization (high throughput). Existing systems typically force a harsh choice: MPS and TGS achieve high utilization at the cost of massive tail latency spikes, while static partitioning methods like MIG provide strong isolation but can lead to underutilization. Tally shows that with fine-grained control, it is possible to have the best of both worlds: robust SLOs for priority tasks and high throughput for scavenger workloads. The results in Figure 5 (page 10) are a powerful illustration of this achievement.

            4. Strong Systems Engineering and Evaluation: The paper describes a well-engineered system that uses standard, robust techniques (LD_PRELOAD, shared memory) to create a practical virtualization layer. The evaluation is thorough, using a diverse set of modern DL workloads, realistic traffic patterns, and strong baselines. The performance decomposition in Section 5.5 (page 11, Figure 7(b)) is particularly insightful, as it clearly proves that both the priority-aware scheduling and the fine-grained kernel transformations are necessary to achieve the claimed performance isolation.


            Weaknesses

            While the work is very strong, a broader contextual analysis reveals areas where its limitations and future challenges could be discussed more explicitly.

            1. Hardware/Software Stack Brittleness: The entire mechanism hinges on the ability to intercept and rewrite PTX code, an intermediate representation specific to NVIDIA's CUDA ecosystem. This makes the solution inherently tied to a single vendor and potentially fragile to changes in the CUDA compiler, driver, and PTX specification. While this is a practical choice given NVIDIA's market dominance, the work would be strengthened by a discussion of the conceptual path to porting this idea to other ecosystems like AMD's ROCm (using its HSAIL or AMDGCN ISA) or Intel's oneAPI (using SPIR-V). This is less a criticism of the current work and more a question of its long-term, generalizable impact across the hardware landscape.

            2. The "Unknown Unknowns" of Kernel Transformation: The "unified synchronization transformation" described in Section 4.1 (page 7) is a clever solution to handle divergent returns before a synchronization point. However, modern GPU kernels, especially from vendor-optimized libraries like cuDNN or CUTLASS, can be extraordinarily complex and employ undocumented behaviors. The paper demonstrates success on a set of representative workloads, but its robustness against the full, untamed "in the wild" spectrum of GPU kernels is an open question. A discussion of the failure modes or the types of kernels that might resist this transformation would add valuable context.

            3. Overhead of Transformation and Profiling: The paper quantifies the runtime overhead of transformed kernels (25% for best-effort tasks, Section 5.7, page 12) and argues that profiling overhead is amortized over long-running jobs. However, it does not discuss the one-time latency of the PTX analysis and recompilation step itself. For environments with very short-lived jobs or a high churn of new, unseen kernel configurations, this initial setup cost could become a non-trivial part of the scheduling latency.


            Questions to Address In Rebuttal

            1. Could the authors elaborate on the robustness of their PTX transformation engine? Have they tested it against a wider corpus of kernels, for example, by extracting kernels from other popular frameworks or applications? What are the primary failure modes, and how does Tally handle a kernel that it cannot safely transform?

            2. Regarding the profiling-guided scheduler, what is the cold-start problem like? How does the system behave when a new, unseen best-effort workload with long-running kernels arrives? Is there a period of poor performance for the high-priority task while Tally profiles and identifies a safe configuration for the new workload?

            3. From a conceptual standpoint, how do the authors see the ideas in Tally influencing the future of GPU architecture and driver design? Given the clear benefits of fine-grained preemption, do you believe this work makes a case for hardware vendors like NVIDIA to provide more direct, low-level support for block-level preemption, potentially obviating the need for complex PTX rewriting?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-02 17:30:29.062Z

                Reviewer: The Innovator (Novelty Specialist)

                Summary

                The paper presents Tally, a system for GPU sharing that aims to provide strong performance isolation for high-priority tasks when co-located with best-effort workloads. The core technical novelty lies in its non-intrusive, block-level scheduling primitives—slicing and preemption—which are implemented via on-the-fly transformation of kernel PTX code. This is coupled with a profile-guided, priority-aware scheduler that dynamically chooses the most appropriate primitive and its configuration for best-effort kernels to meet the latency requirements of high-priority tasks. The authors claim this synthesis provides robust isolation without requiring source code modifications or application-specific characteristics, differentiating it from prior art.

                Strengths

                The primary strength of this paper lies in its novel synthesis and implementation of existing concepts into a practical, non-intrusive system.

                1. Novelty in Automated Transformation: The central novel contribution is the automated, non-intrusive transformation of general GPU kernels into a preemptible form based on the Persistent Thread Block (PTB) pattern. While the PTB pattern itself is a known programming paradigm [27], automating this conversion at the PTX level for arbitrary, non-idempotent kernels is a non-trivial and novel contribution. This approach successfully differentiates itself from prior work like REEF [28], which achieved fine-grained preemption but was limited to idempotent kernels.

                2. The Unified Synchronization Transformation: The proposed "unified synchronization transformation" (Section 4.1, page 6) is a particularly novel component designed to solve the difficult problem of divergent threads attempting to return or synchronize at different points within a transformed kernel. This is a specific and clever technical solution that enables the broader goal of safely automating the PTB transformation. This is a significant delta over simply stating that kernels can be wrapped in a loop.

                3. Novelty in the Control Plane: While profile-guided scheduling is a known paradigm, its application to dynamically select between two distinct fine-grained scheduling primitives (slicing vs. preemption) based on their observed "turnaround latency" is a novel control strategy in this context. It recognizes that neither primitive is universally optimal and builds a mechanism to make an informed choice at runtime, which has not been explored in prior GPU sharing systems to this degree.

                Weaknesses

                The paper's claims of novelty must be carefully contextualized, as the foundational concepts have been explored in prior work.

                1. Conceptually Established Primitives: While the implementation is novel, the underlying primitives themselves are conceptually well-established. Kernel slicing for concurrent execution was explored in Kernelet [80]. Block-level preemption via code transformation has been demonstrated in systems like Effisha [14] and GPES [81]. The primary "delta" Tally offers over these is its non-intrusive nature (PTX vs. source/compiler-level modification). The paper correctly identifies this, but the conceptual novelty is therefore an incremental step (non-intrusiveness) rather than a foundational one.

                2. Robustness of PTX Transformation: The reliance on PTX-level transformation raises questions about its robustness and future-proofing. PTX is a volatile intermediate representation, and the complexity of modern kernels is immense. The paper demonstrates success on a set of DL workloads, but the proposed transformations (especially the unified synchronization) may not be robust to all possible control flow constructs, indirect branches, or new instructions introduced in future GPU architectures. The novelty is tied to an engineering approach whose generalizability is not fully proven.

                3. Marginal Novelty of the Scheduler's Logic: The priority-aware scheduler's novelty is primarily in its application rather than in its fundamental design. The logic—prioritize high-priority tasks, preempt low-priority ones, and use a profiler to tune parameters—is a standard approach in real-time and priority-based systems. The contribution is in applying this logic to the specific primitives of transformed GPU kernels, not in inventing a new scheduling theory.

                Questions to Address In Rebuttal

                1. The concept of kernel slicing for concurrent execution was explored in Kernelet (Zhong and He, TPDS 2013). This work also divided a kernel's grid into smaller chunks to be scheduled. Could the authors elaborate on the novel contributions of their slicing implementation beyond its non-intrusive nature and integration with the preemption primitive?

                2. The "unified synchronization transformation" is clever, but how robust is the PTX transformation pipeline to highly complex kernels with intricate control flow, indirect branches, or utilization of newer ISA features not present in the evaluated workloads? The novelty of this approach is contingent on its generality. What are the known limitations or classes of kernels that Tally cannot transform correctly?

                3. The transformation to a PTB-style kernel involves replacing direct returns with branches and adding a global task counter and flag checks. This introduces overhead in the form of additional instructions and contention on the global counter. The paper evaluates end-to-end performance, but for the novelty to be fully assessed, can the authors quantify the per-invocation overhead of the transformation itself? How does this overhead scale with kernel complexity versus the simple slicing approach? This would help clarify the trade-off that the novel scheduler is designed to manage.