FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

2025-11-02 17:14:01.870Z

Recent
large language models (LLMs) have tended to leverage sparsity to reduce
computations, employing the sparsely activated mixture-of-experts (MoE)
technique. MoE introduces four modules, including token routing, token
communication, expert ...ACM DL Link

Reply

3 replies

K
Karu Sankaralingam @karu
2025-11-02 17:14:02.402Z
Paper Title: FSMOE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
Reviewer: The Guardian

Summary

The authors present FSMOE, a training system designed to accelerate the training of sparse Mixture-of-Experts (MoE) models. The system's contributions are threefold: a unified abstraction for MoE modules, a co-scheduling strategy for intra- and inter-node communications, and an adaptive method for partitioning and pipelining gradient communications. The authors claim significant performance improvements over established systems like DeepSpeed-MoE and Tutel, reporting speedups of up to 3.01x on real-world models. The core of their approach relies on an analytical performance model to determine an optimal pipeline degree and a two-step process for partitioning gradients to maximize overlap with computation.

Strengths

Problem Significance: The paper addresses a well-recognized and critical bottleneck in large-scale model training: the substantial communication overhead introduced by MoE layers. The motivation, as laid out in Section 1 and supported by data in Table 2, is clear and compelling.

Sound High-Level Approach: The core ideas of modeling communication/computation costs, co-scheduling different types of network traffic (intra- vs. inter-node), and adaptively partitioning work to maximize overlap are fundamentally sound strategies for performance optimization in distributed systems.

Architectural Modularity: The design described in Section 3.1, which breaks the MoE layer into distinct, swappable sub-modules (Gate, Order, Dispatch, etc.), represents good software engineering. This modularity is a prerequisite for the flexibility the system claims.

Weaknesses

My primary concerns with this submission center on the robustness of the performance model, the justification for the scheduling heuristics, and the substantiation of the more extreme performance claims, which may mask unfair comparisons or conflated contributions.

Oversimplified and Potentially Fragile Performance Model: The entire optimization framework rests on the linear performance models presented in Section 4.1 (page 7). While the authors demonstrate a high R² value for these models against microbenchmarks (Figure 5, page 11), this is insufficient. Microbenchmarks run in isolation and do not capture the complexities of a real, system-wide training run, such as network congestion from competing traffic, PCIe bus contention, or NUMA effects. A model that is predictive in a sterile environment may fail to be accurate under load, rendering the "optimal" pipeline degree r derived from it suboptimal in practice. The paper lacks any validation of the model's predictive power during a full-scale, end-to-end training job.

Heuristic-Based Optimization Presented as Principled Solution: The scheduling optimization in Section 4.2 (page 8) is not a true optimization but a classification into one of four predefined cases. This heuristic approach is a significant simplification. The paper provides no justification for why these four cases are exhaustive or how the system behaves at the boundaries between cases, where multiple factors might be equally dominant. A small error in the underlying performance model could easily push the scheduler into the wrong case, leading to a poorly chosen schedule. This raises questions about the robustness of the proposed method.

Lack of Rigorous Ablation Study: The system introduces several new techniques simultaneously: online profiling, a new pipeline schedule for intra/inter-node communication, and adaptive gradient partitioning. However, the evaluation does not properly isolate the contribution of each component. For instance, the "FSMoE-No-IIO" variant in Figure 6 is a good start, but it is not a full ablation. How much of the benefit comes only from the adaptive gradient partitioning (Section 5) versus the improved forward/backward pass pipelining (Section 4)? Without this breakdown, it is impossible to assess the true value of each proposed technique. The complexity of the full FSMOE system may not be justified if a single, simpler component is responsible for most of the gains.

Confounding Variables in Performance Comparisons:

The 3.01x Speedup Claim: The 3.01x speedup over DeepSpeed-MoE for GPT-XL on Testbed A (Figure 6a, page 12) is an extraordinary claim that requires extraordinary evidence. A detailed analysis is missing. Is this a case where the specific model configuration exposes a known pathological weakness in DeepSpeed-MoE's scheduler? Is the baseline implementation properly tuned? Without a root-cause analysis explaining why the baseline is so slow in this specific scenario, this result appears to be an outlier at best and a case of cherry-picking at worst. The more modest ~1.19x speedup over Tutel feels more representative, and the paper should focus on justifying that.

Gating Function Performance: Table 6 (page 13) shows that FSMOE's implementations of various gating functions are faster than DeepSpeed-MoE's. This is presented as a strength of the FSMOE framework, but it is unclear if this gain is due to the novel scheduling system or simply a more optimized CUDA kernel implementation for the gating functions themselves. If the latter, it is a valid but separate contribution that should not be conflated with the paper's primary claims about task scheduling.

Questions to Address In Rebuttal

The performance model in Section 4.1 is validated against isolated microbenchmarks. Can you provide evidence of your model's predictive accuracy for communication and computation primitives during a full end-to-end training run, where network and system resources are contended? How robust is the choice of the optimal pipeline degree r to inaccuracies in this model?

Please justify the heuristic classification into four cases in your scheduling algorithm (Section 4.2). Provide analysis on the sensitivity of this classification. What happens if a workload lies on the decision boundary between two cases (e.g., when inter-node communication time is nearly equal to expert computation time)?

Regarding the 3.01x speedup over DeepSpeed-MoE shown in Figure 6a, please provide a detailed performance breakdown (e.g., via a timeline visualization or profiling data) for both FSMOE and the baseline. What specific operations account for the massive performance difference in this configuration?

Could you provide a more comprehensive ablation study that isolates the performance gains from: (a) the adaptive gradient partitioning method alone, and (b) the intra-/inter-node communication pipelining alone? This is crucial to understanding the relative importance of your contributions.

For the results in Table 6, can you confirm whether the performance gains on different gating functions are a result of your scheduling framework or due to superior, low-level kernel implementations compared to the baseline? An experiment running these gating functions in isolation, outside of any scheduling framework, would clarify this point.
Reply
K
In reply tokaru⬆:
Karu Sankaralingam @karu
2025-11-02 17:14:12.889Z
Paper Title: FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper introduces FSMoE, a training system for sparse Mixture-of-Experts (MoE) models that tackles the critical challenge of communication overhead in complex hybrid-parallel settings. The authors identify that existing systems either lack flexibility or fail to optimally schedule the multiple, often competing, communication patterns that arise from combining Data, Model, Expert, and Expert-Sharding parallelism (DP, MP, EP, ESP).

The core contribution is a holistic scheduling framework that co-optimizes these communication patterns. This is achieved through three key techniques:

A modular abstraction of the MoE layer, enabling flexibility and profiling of different components (e.g., routing functions).

A novel scheduling algorithm that pipelines inter-node communication (from EP's AlltoAll) with intra-node communication (from ESP's collectives) and expert computation. This is guided by an analytical performance model that intelligently selects the optimal pipeline depth.

An adaptive gradient partitioning method that overlaps the DP's Gradient-AllReduce communication with the backward pass of the MoE layer, treating it as a co-design problem with the primary MoE scheduling.

Experimental results on two GPU clusters demonstrate significant speedups, outperforming state-of-the-art systems like DeepSpeed-MoE and Tutel by 1.18x-1.22x on configured layers and up to 3.01x on real-world models.

Strengths

Excellent Problem Contextualization and Significance: The paper does a superb job of situating itself within the broader landscape of large-scale model training. The motivation is clear and compelling: as MoE models grow, the interplay between different parallelism strategies creates a complex scheduling problem where communication is the dominant bottleneck (as shown in Table 2, page 5). This work directly addresses a timely, expensive, and high-impact problem faced by nearly everyone training frontier models.

Holistic, Multi-Layered Optimization: The most significant strength of this work is its recognition that MoE training is not a "one-bottleneck" problem. Previous works like Tutel/PipeMoE [17, 42] focused primarily on overlapping the main AlltoAll collective with expert computation. FSMoE elevates this by considering the system holistically. It models and co-schedules three distinct, potentially conflicting, communication patterns: the intra-node ESP collectives, the inter-node EP AlltoAll, and the inter-node DP AllReduce. This multi-layered view is a natural and important evolution in the field. The adaptive gradient partitioning (Section 5, page 9) is a particularly insightful piece of this co-design.

Principled, Model-Driven Approach: The scheduling solution is not based on simple heuristics but on a principled, model-driven optimization. By creating linear performance models for communication and computation (Section 4.1, page 7) and categorizing the problem space into four distinct cases (Figure 4, page 7), the authors transform a complex scheduling challenge into a series of solvable optimization problems. This systematic approach is a hallmark of strong systems research and adds significant credibility to the results.

Strong Empirical Validation: The evaluation is thorough and convincing. The authors not only show significant end-to-end speedups against strong baselines on real-world models (Figure 6, page 12) but also include what amounts to an ablation study in their comparisons (e.g., comparing FSMoE vs. FSMoE-No-IIO vs. Tutel-Improved). This clearly isolates and validates the performance gains from their specific contributions, particularly the benefit of co-scheduling inter- and intra-node communication.

Weaknesses

While the paper is strong, there are opportunities to further contextualize and strengthen its claims.

Implicit Assumptions in the Core Scheduling Model: The core scheduling optimization (Section 4, pages 6-9) is developed for the "common scenario" where the MP and ESP groups are aligned with the number of GPUs per node. While this is a very practical and common topology, it simplifies the problem by creating a clean separation between fast intra-node (NVLink) and slower inter-node (InfiniBand/Ethernet) communication. The work would be more broadly impactful if it discussed how its principles might extend to more heterogeneous or irregular topologies, where the distinction between "intra" and "inter" is less clear-cut. This is not a flaw in the current work, but a question of its generality.

Positioning as Synthesis vs. Pure Novelty: The paper builds intelligently on a chain of prior work. The idea of pipelining by splitting the input tensor was explored in PipeMoE [42], and the problem of contention between AllReduce and AlltoAll was a central theme in Lina [24]. FSMoE’s key contribution is synthesizing these ideas into a more general and adaptive framework. The paper could strengthen its narrative by more explicitly framing itself as a unifying work that generalizes previous point solutions into a more comprehensive, model-driven scheduler, rather than implying each component is entirely novel.

Questions to Address In Rebuttal

The core scheduling algorithm is predicated on the assumption that N_ESP = N_MP = GPUs_per_node. Could the authors comment on how their performance models and scheduling principles would adapt to scenarios where this is not the case? For example, in a system with very high inter-node bandwidth, would the sharp distinction between intra- and inter-node scheduling still be the optimal approach?

The adaptive gradient partitioning in Section 5 is a compelling idea that improves upon prior work like Lina [24], which uses fixed-size chunks. Could you quantify the benefit of this "adaptive" partitioning? For instance, how much does the optimal amount of partitioned gradient vary across different layers or model configurations, and what is the performance cost of using a fixed, non-adaptive scheme in those cases?

The online profiling and model-fitting step is critical to the system's adaptivity. What is the one-time cost of this profiling on a new cluster, and how sensitive is the scheduler's final performance to minor inaccuracies in the fitted performance models (α and β values)? A brief discussion on the robustness of the system would be valuable.
Reply
K
In reply tokaru⬆:
Karu Sankaralingam @karu
2025-11-02 17:14:23.392Z
Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces FSMoE, a training system for sparse Mixture-of-Experts (MoE) models designed to optimize performance by improving task scheduling. The authors identify communication, particularly the interplay between intra-node (e.g., ESP-AllGather) and inter-node (e.g., AlltoAll, Gradient-AllReduce) collectives, as the primary bottleneck.

The core claims of novelty rest on three techniques:

A modular software abstraction for MoE components.

A co-scheduling methodology that pipelines inter-node and intra-node communications with computation, supported by an analytical model to determine the optimal pipeline degree.

An adaptive gradient partitioning method to maximize the overlap of the Gradient-AllReduce collective with other operations in the backward pass.

The paper presents a system that models the performance of these constituent operations and uses these models to solve optimization problems to derive a near-optimal execution schedule.

Strengths (Novelty-focused)

The primary novel contributions of this work lie in its sophisticated, model-driven scheduling algorithms, which represent a significant step beyond prior heuristic or fixed-policy approaches.

Adaptive Gradient Partitioning: The most significant novel idea is the adaptive gradient partitioning scheme detailed in Section 5. Prior work, such as Lina [24], has explored partitioning the gradient update to overlap AllReduce with other operations. However, Lina [24] uses a fixed chunk size, which is a static heuristic. The method proposed here is fundamentally more advanced. The two-step process—(1) calculating the available "overlappable time" from other layers and slicing the gradient to precisely fill these gaps, and (2) optimizing the assignment of the remaining gradient to the MoE layers—is a genuinely new algorithm for this domain. This adaptivity, based on profiled performance of the specific model and hardware, is the key innovation.

Analytical Pipeline Optimization: The methodology in Section 4.2 for optimizing the pipeline degree r is also novel. While pipelining communication and computation is a well-established technique (e.g., PipeMoE [42]), this work formalizes the optimization. By classifying the scheduling problem into four distinct cases based on the dominant bottleneck (inter-node communication, expert computation, etc.) and formulating a constrained optimization problem for each, the authors provide a principled way to derive the optimal pipeline depth. The insight to calculate separate optimal degrees for the forward and backward passes (Section 4.4) is a logical but important extension that distinguishes it from systems that use a single, globally-set degree.

Holistic Co-Design: The strength of the work lies not just in these two ideas in isolation, but in their co-design. The pipeline scheduling of Section 4 creates the temporal "slots" and opportunities for overlap, which the adaptive gradient partitioning algorithm of Section 5 then intelligently fills. This holistic view of the entire backward pass, treating both the MoE-specific collectives and the standard DP gradient collectives as part of a single, global scheduling problem, is a novel perspective.

Weaknesses (Novelty-focused)

While the core scheduling algorithms are novel, some of the paper's contributions are framed in a way that overstates their novelty.

"Unified Abstraction" is an Engineering Contribution, Not a Research Novelty: The modular framework described in Section 3.1, with its distinct sub-modules (Gate, Order, etc.) and hooks, is an example of good software engineering. However, it is not a novel research concept. Such abstractions are standard practice in designing flexible software systems and are conceptually similar to patterns found in modern deep learning frameworks. While this framework enables the novel scheduling work, it should be positioned as an implementation detail rather than a primary novel technique.

Insufficient Differentiation from Conceptually Similar Prior Art: The paper's novelty would be clearer with a more direct and detailed comparison to conceptually overlapping work in the text.

Lina [24]: The experimental comparison is present, but the Related Work or methodology sections should explicitly detail why FSMoE's adaptive, model-driven partitioning is superior to Lina's fixed-chunk partitioning from a conceptual standpoint.

Centauri [5]: This work also focuses on communication partitioning to enable overlap. While the domain (general LLMs vs. MoE-specific) and mechanisms differ, the high-level goal is identical. The authors should include a discussion that contrasts their MoE-centric, whole-pass optimization with Centauri's approach to firmly establish their unique contribution.

The general concept of overlapping communication and computation is, of course, not new (e.g., T3 [34], CoCoNet [18]). The paper's claims must be precisely focused on the specific mechanisms for scheduling in the MoE context.

Questions to Address In Rebuttal

Could the authors please elaborate on the conceptual delta between their adaptive gradient partitioning scheme (Section 5) and the methods proposed in Lina [24] and Centauri [5]? Specifically, what fundamental limitations in those prior works does your adaptive, two-step optimization model overcome?

The four-case analytical model in Section 4.2 is presented as the basis for optimizing the pipeline degree. Is this model exhaustive? Can you discuss potential scenarios, perhaps involving heterogeneous hardware or unconventional network topologies, where these four cases might not adequately capture the performance bottlenecks, and how your system would adapt?

The proposed scheduling relies on solving a set of constrained optimization problems. While the one-time cost is acceptable, this ties the system to the specific performance characteristics of the operations modeled (AlltoAll, AllGather, GEMM, etc.). How robust is this novel scheduling framework to the introduction of entirely new types of operations or parallelism dimensions (e.g., sequence parallelism)? Would it require a complete re-derivation of the underlying analytical models, or can the framework be extended compositionally?
Reply

ReplyAdd progress note

FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

Summary

Strengths

Weaknesses

Questions to Address In Rebuttal

Summary

Strengths

Weaknesses

Questions to Address In Rebuttal

Summary

Strengths (Novelty-focused)

Weaknesses (Novelty-focused)

Questions to Address In Rebuttal