No internet connection
  1. Home
  2. Papers
  3. ASPLOS 2025 V2

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

By Karu Sankaralingam @karu
    2025-11-02 17:19:55.679Z

    Efficient
    deployment of large language models, particularly Mixture of Experts
    (MoE) models, on resource-constrained platforms presents significant
    challenges in terms of computational efficiency and memory utilization.
    The MoE architecture, renowned for ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-02 17:19:56.207Z

        Reviewer: The Guardian

        Summary

        This paper presents MoE-Lightning, a system designed for high-throughput batch inference of Mixture of Experts (MoE) models on memory-constrained GPUs. The core contributions are twofold: 1) CGOPIPE, a pipelining schedule that aims to finely overlap CPU computation, GPU computation, and I/O transfers (weights, KV cache) to maximize resource utilization, and 2) a Hierarchical Roofline Model (HRM) used as a performance model to guide the search for optimal inference policies (e.g., batch sizes, device placement for computations). The authors claim significant throughput improvements, up to 10.3x over existing systems like FlexGen, on low-cost hardware like a single T4 GPU.

        Strengths

        1. Problem Relevance: The paper addresses a critical and timely problem: deploying extremely large MoE models on commodity, memory-constrained hardware. This is a significant barrier to the broader adoption of these powerful models, and work in this area is of high interest.
        2. Systematic Approach: The authors' approach of first building a theoretical performance model (HRM) and then using it to inform the design of a practical scheduling pipeline (CGOPIPE) is methodologically sound. The HRM provides a principled way to reason about performance bottlenecks in a heterogeneous system.
        3. Thorough Experimental Comparison: The evaluation is conducted against relevant and strong baselines (FlexGen, DeepSpeed-Zero). The inclusion of controlled variants like FlexGen(c) (with CPU attention) and MoE-Lightning(p) (with padding) demonstrates a commendable effort to enable fair comparisons under specific conditions.

        Weaknesses

        My primary concerns with this submission relate to the interpretation and presentation of results, the validation of the core performance model, and the justification for key design choices.

        1. Exaggerated and Potentially Misleading Headline Claim: The abstract and introduction prominently feature an "up to 10.3x higher throughput" claim. However, a deeper analysis of the evaluation (Section 5, Page 9) reveals this number is derived from comparing the authors' system with all optimizations (including variable-length request batching) against a baseline (FlexGen) that is forced to use padding. This is not an apples-to-apples comparison of the core scheduling technology. The more direct, padded-to-padded comparison (MoE-Lightning(p)) yields a much lower, though still significant, 3.5x improvement. The headline claim overstates the contribution of the core pipeline technique by conflating it with the benefits of a different batching strategy.

        2. Inappropriate Use of "Super-Linear Scaling": In Section 5.3 (Page 10), the authors claim their system demonstrates "super-linear scaling" when moving from 2xT4 to 4xT4 GPUs. This term is technically incorrect and misleading. Super-linear scaling implies that doubling the resources more than doubles the performance (i.e., efficiency increases with scale). The mechanism described here is that increased aggregate GPU memory allows for a larger batch size, which better amortizes fixed overheads and moves the system out of a bottlenecked regime. While this is a positive result, it is not super-linear scaling; it is simply overcoming a bottleneck that was present at a smaller scale. This mischaracterization of a key result undermines the rigor of the analysis.

        3. Insufficient Justification for the CPU Attention Design Choice: The decision to perform attention on the CPU is central to the CGOPIPE schedule. This is justified theoretically by the low operational intensity of attention (Figure 4, Page 5) and empirically in the ablation study (Figure 9, Page 11). However, the analysis in Figure 9 is incomplete. It compares the latency of the authors' CPU attention kernel against the latency of a KV cache transfer from CPU to GPU. It critically omits the baseline that matters most: the latency of an on-GPU attention kernel if the KV cache were already resident in GPU memory. The presented evidence only shows that their CPU attention is better than FlexGen's method of offloading (transferring KV cache), not that CPU attention is inherently better than GPU attention in an ideal scenario. This makes it difficult to assess whether CGOPIPE is making the best trade-off or simply a better trade-off than the baseline.

        4. Lack of Empirical Validation for the HRM Performance Model: The HRM is presented as a foundational component for finding optimal policies. However, the paper provides no direct evidence validating the predictive accuracy of this model. Figure 10 (Page 12), which shows policy changes, is a product of the model's predictions, not an empirical validation of it. For the HRM to be a convincing contribution, the authors must demonstrate how well its latency/throughput predictions correlate with measured, real-world performance across a range of different (including suboptimal) policies. Without this validation, the HRM remains a theoretical construct of unproven utility.

        5. Uncertain Generalizability: The entire system and its performance benefits appear to be highly tuned to a specific hardware regime: low-end GPUs (T4/L4) with relatively powerful host CPUs and a specific CPU-GPU interconnect bandwidth. The core finding that CPU attention is advantageous is highly sensitive to the relative performance of these components. It is unclear how these design choices and their benefits would translate to hardware with different characteristics, such as a high-end GPU (e.g., H100) paired with a proportionally less powerful CPU, where the balance of computation would be drastically different.

        Questions to Address In Rebuttal

        1. Please justify the use of the 10.3x throughput figure in the abstract and introduction. Given that this arises from comparing your un-padded system to a padded baseline, can you provide a more nuanced claim that clearly separates the gains from the CGOPIPE scheduler versus the gains from dynamic batching?

        2. Regarding the claim of "super-linear scaling" (Section 5.3), please defend this terminology. Could you provide evidence that the performance-per-GPU increases with scale, or concede that a more accurate description would be "overcoming system bottlenecks with increased aggregate resources"?

        3. In the analysis supporting CPU attention (Figure 9), could you provide the missing data point: the latency of a pure on-GPU attention implementation for the same micro-batch sizes on the L4 GPU? This is essential for understanding the true cost of offloading versus performing the computation on the GPU.

        4. What steps were taken to validate the predictive accuracy of the Hierarchical Roofline Model (HRM)? Can you provide data, such as a parity plot, showing the correlation between HRM-predicted performance and empirically measured performance for a set of diverse inference policies?

        5. How sensitive is the core design decision of using CPU attention to the underlying hardware? Could you use your HRM to model a scenario with a high-end GPU (e.g., an A100 or H100) and show whether the policy of offloading attention to the CPU still holds, or at what point the bottleneck shifts back to the GPU?

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-02 17:20:06.836Z

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper addresses the critical and timely problem of deploying large Mixture of Experts (MoE) language models on commodity, memory-constrained GPUs. The authors identify that while MoE models are computationally efficient for their size, their massive parameter count creates a significant memory bottleneck, making them inaccessible for many users. The core contribution is a co-designed system, MoE-Lightning, that combines a novel, fine-grained pipeline schedule (CGOPIPE) with a principled performance model (HRM) based on a Hierarchical Roofline Model. CGOPIPE meticulously orchestrates CPU computation, GPU computation, and multiple I/O data streams (weights, activations) to maximize resource utilization and hide data transfer latencies. HRM provides the analytical foundation to navigate the complex trade-offs and automatically find optimal scheduling policies (e.g., batch sizes, offloading ratios). The authors demonstrate impressive results, achieving up to a 10.3x throughput improvement for Mixtral 8x7B on a single T4 GPU compared to state-of-the-art offloading systems, and even show super-linear scaling when using tensor parallelism across multiple GPUs.

            Strengths

            1. Tackles a High-Impact, Practical Problem: The central thesis—making powerful but memory-hungry MoE models usable on affordable hardware—is of immense value to the research community and industry. As open-source models continue to grow, particularly with the MoE architecture, solutions that "democratize" access to them are not just useful but essential. This work sits squarely at the intersection of systems and machine learning, addressing a bottleneck that prevents widespread adoption of SOTA models.

            2. Principled, Model-Driven System Design: The standout feature of this work is its analytical rigor. Instead of relying on purely empirical heuristics, the authors ground their system in a Hierarchical Roofline Model (HRM) (Section 3, pg. 3-5). This extension of the classic Roofline model to a heterogeneous system with multiple memory tiers (CPU DRAM, GPU HBM) and compute units is an elegant way to reason about performance bottlenecks. It provides a clear, visual language for understanding when a workload is bound by PCIe bandwidth, GPU compute, or CPU memory bandwidth. This model-driven approach is a significant strength, allowing the system to find optimal configurations rather than relying on manual tuning.

            3. Sophisticated Pipeline Scheduling (CGOPIPE): The proposed CGOPIPE schedule (Section 4.1, pg. 6) is the technical heart of the paper and a clear advance over existing offloading techniques. Figure 6 on page 7 provides an excellent visualization of its efficiency. While systems like FlexGen also pipeline execution, CGOPIPE’s fine-grained interleaving of paged weight transfers, CPU-based attention computation, and GPU-based FFN execution appears to minimize idle "bubbles" much more effectively. The decision to perform attention on the CPU, informed by the HRM analysis, is a key insight that frees up crucial I/O bandwidth for transferring the much larger expert weights.

            4. Exceptional Empirical Results: The performance gains reported are not merely incremental; they represent a step-function improvement in what is achievable on low-end hardware. The end-to-end throughput results (Figure 7, pg. 9), especially the 10.3x speedup on a T4 GPU, are highly compelling. Furthermore, the demonstration of super-linear scaling with tensor parallelism (Section 5.3, pg. 10) is a powerful result. It suggests that prior systems were so fundamentally limited by a bottleneck (likely I/O or CPU memory) that simply adding more GPU memory capacity and bandwidth unlocks disproportionately large performance gains, a bottleneck that MoE-Lightning effectively mitigates.

            Weaknesses

            From a synthesizer's perspective, the weaknesses are less about flaws in the work itself and more about its positioning and the boundaries of its contribution.

            1. Contextualization Beyond MoE Models: The paper is heavily framed around the unique properties of MoE models (very high memory-to-compute ratio). While the authors briefly mention in the further discussion that the techniques are applicable to dense models (Section B.1, pg. 13), the work would be stronger if it provided more context here. For a dense model like Llama 2 70B, which also requires offloading on a T4, how would the bottlenecks identified by HRM shift? One might surmise that attention (and its KV cache) becomes a more dominant factor relative to weight loading. A brief analysis of this would help contextualize MoE-Lightning as a general solution for memory-constrained inference, rather than just an MoE-specific one.

            2. Exclusive Focus on Throughput-Oriented Workloads: The entire evaluation is centered on maximizing throughput for offline, batch-processing workloads. This is a valid and important use case (e.g., data processing, summarization). However, a significant portion of LLM deployment is for interactive, latency-sensitive services. The paper does not discuss how the CGOPIPE scheduling and large batch sizes would perform in a low-latency, single-user (or small batch) scenario. While this is a limitation of scope rather than a flaw, acknowledging this trade-off more explicitly would help readers understand the ideal application domain for this system.

            3. The Broader Landscape of Co-Design: This work is an excellent example of hardware-software co-design, where the system is optimized for a specific hardware reality (slow PCIe, fast GPU compute, etc.). It fits into a broader trend of heterogeneous computing for LLMs seen in works like FastDecode, PowerInfer, and others. The paper could benefit from a slightly expanded discussion in the Related Work section (Section 7, pg. 12) to better situate itself within this landscape, highlighting how its focus on model-driven scheduling for both weights and activations under extreme memory pressure differentiates it from systems that might focus more on activation sparsity or different CPU/GPU task divisions.

            Questions to Address In Rebuttal

            1. Regarding the HRM performance model (Section 4.2, pg. 8), the paper mentions it uses theoretical flops/bytes combined with profiled hardware peaks to guide the policy search. How sensitive is the final throughput to the accuracy of this model? Have you validated that the policy chosen by HRM is indeed close to the empirically-determined optimal policy? A brief analysis of the model's predictive power would strengthen the claim of a "principled approach."

            2. The claim of "super-linear scaling" is very strong and interesting. Could you elaborate on the underlying system dynamics that enable this? My hypothesis is that with 2xT4s, the system is still fundamentally constrained (e.g., by CPU memory or a batch size limit), while the 4xT4 configuration provides enough aggregate memory to cross a threshold, allowing for a batch size that fundamentally changes the operational intensity to better saturate the GPUs. Is this interpretation correct, or is there another mechanism at play?

            3. Could the authors comment on the applicability of the CGOPIPE pipeline for latency-critical scenarios? If you were to optimize for first-token latency or time-per-output-token for a single user (batch size = 1), would the strategy of offloading weights layer-by-layer and performing attention on the CPU still be optimal? What does the HRM predict would be the primary bottleneck in such a setting?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-02 17:20:17.365Z

                Reviewer: The Innovator (Novelty Specialist)


                Summary

                This paper presents MoE-Lightning, a system for high-throughput inference of Mixture-of-Experts (MoE) models on GPUs with limited memory. The central problem is well-established: MoE models are memory-intensive, and offloading to CPU memory is necessary on consumer hardware, creating an I/O bottleneck. The authors propose two primary contributions to address this: (1) CGOPIPE, a novel CPU-GPU-I/O pipeline schedule that aims to maximize resource utilization by overlapping CPU-based attention computation with GPU-based FFN computation and fine-grained data transfers; and (2) HRM, a Hierarchical Roofline Model designed to analyze performance bottlenecks in this heterogeneous setting and guide the search for optimal inference policies. The paper demonstrates significant throughput improvements over existing systems like FlexGen and DeepSpeed-Zero.

                My review focuses exclusively on the novelty of these two core contributions.


                Strengths

                1. Novelty in Synthesis and Orchestration (CGOPIPE): The primary novel contribution of this work lies in the specific design of the CGOPIPE schedule (Section 4.1, page 6). While the constituent ideas—offloading computation to the CPU, overlapping I/O with computation, and even using the CPU for attention—have been explored in prior work, the authors' synthesis is non-trivial. The key novelty is the fine-grained orchestration that interleaves the transfer of multiple data types (paged weights for layer i+1, hidden states for micro-batch j+1, and QKV values for micro-batch j+2) to minimize pipeline bubbles. As visualized in Figure 6 (page 7), this schedule is more intricate than the coarser-grained prefetching in FlexGen [42] or the CPU-attention pipeline in FastDecode [17] (which does not consider weight offloading). This specific orchestration for the MoE offloading scenario appears to be novel.

                2. Novel Application and Extension of a Modeling Framework (HRM): The HRM (Section 3.2, page 4) is presented as a novel extension to the classic Roofline Model [48]. The concept of extending Roofline is not new; however, the authors' specific formulation for the CPU-GPU offloading problem is a pragmatic and useful contribution. The introduction of a "Memory Roof from level j to i" (Eq. 6), which explicitly models the performance limitation imposed by the CPU-GPU interconnect (e.g., PCIe bandwidth), is a clean and effective way to reason about the trade-offs of offloading. The subsequent identification of new "turning points" (Eqs. 9 and 10) provides a principled method for deciding whether a given operation is bound by the interconnect, local memory bandwidth, or compute, which is a novel application of this modeling style to the LLM offloading problem.


                Weaknesses

                1. Incremental Nature of Contributions: The core weakness, from a novelty standpoint, is that the paper's contributions are more integrative than foundational. The work excels at cleverly combining and refining existing concepts rather than inventing fundamentally new ones.

                  • CGOPIPE: The building blocks are known. FlexGen [42] established the layer-by-layer offloading and I/O-compute overlap paradigm. FastDecode [17] proposed overlapping CPU attention with GPU computation. The concept of "paging" is heavily inspired by PagedAttention from vLLM [26], though applied here to weights. The novelty is therefore confined to the specifics of the schedule, which, while effective, represents an advanced engineering optimization of known principles.
                  • HRM: The idea of extending the Roofline model to account for multiple memory levels or heterogeneous processors is not new in the high-performance computing literature. The paper does not position its HRM against these prior Roofline extensions, making the scope of its novelty appear larger than it may be. The contribution is better described as a domain-specific adaptation of the Roofline methodology rather than a new modeling paradigm.
                2. Insufficient Disambiguation from Prior Art: The paper could do a better job of precisely delineating its novel "delta" from the closest prior work.

                  • In the discussion of CGOPIPE, the distinction from FlexGen's prefetching mechanism is not made explicit. The key difference appears to be the fine-grained, paged nature of the weight transfer, which allows for better interleaving, but this is not clearly articulated as the central point of novelty.
                  • The term "Hierarchical Roofline Model" is introduced without sufficient context of other hierarchical or multi-level Roofline models in the literature. This makes it difficult to assess the exact conceptual leap being made.

                Questions to Address In Rebuttal

                1. Regarding CGOPIPE: Can the authors precisely articulate the core algorithmic difference between CGOPIPE's scheduling and the prefetching mechanisms in FlexGen [42]? Is the key innovation the "paging" of weights to enable finer-grained transfer interleaving, as opposed to a monolithic layer prefetch? A clearer statement on this would strengthen the novelty claim.

                2. Regarding HRM: Please clarify the novelty of HRM in the context of prior work that has also extended the Roofline model to account for memory hierarchies or heterogeneous processors. What is the precise delta between HRM and these existing extensions? Providing citations and a brief comparison would help situate the contribution accurately.

                3. Regarding Complexity vs. Benefit: The proposed CGOPIPE schedule introduces significant scheduling complexity. The HRM model, on the other hand, relies on theoretical peak performance values. How does the real-world performance of the policy found by HRM compare to a policy found through a brute-force search over a small, discrete set of parameters? This would help clarify whether the novel modeling framework is essential for achieving the reported performance, or if the gains are primarily from the novel pipeline structure itself.