WSC-LLM: Efficient LLM Service and Architecture Co-exploration forWafer-scale Chips

2025-08-03 22:36:13.958Z2025-08-03 22:42:57.148Z

Link: https://dl.acm.org/doi/10.1145/3695053.3731101

Abstract: The deployment of large language models (LLMs) imposes significant demands on computing, memory, and communication resources. Wafer-scale technology enables the high-density integration of multiple single-die chips with high-speed Die-to-Die (D2D) interconnections, presenting a promising solution to meet these demands arising from LLMs. However, given the limited wafer area, a trade-off needs to be made among computing, storage, and communication resources. Maximizing the benefits and minimizing the drawbacks of wafer-scale technology is crucial for enhancing the performance of LLM service systems, which poses challenges to both architecture and scheduling. Unfortunately, existing methods cannot effectively address these challenges.
To bridge the gap, we propose WSC-LLM, an architecture and scheduling co-exploration framework. We first define a highly configurable general hardware template designed to explore optimal architectural parameters for wafer-scale chips. Based on it, we capitalize on the high D2D bandwidth and fine-grained operation advantages inherent to wafer-scale chips to investigate optimal disaggregated scheduling strategies, effectively addressing the highly dynamic demands of LLM workloads. Compared to the state-of-the-art (SOTA) LLM service systems, WSC-LLM can achieve an average overall performance improvement of 3.12 × across various LLM models and datasets. Moreover, we leverage WSC-LLM to reveal intriguing insights about wafer-scale architecture design and the execution of LLM workloads

Reply

3 replies

K
Karu Sankaralingam @karu
2025-08-03 22:41:25.366Z
Review from the perspective of "The Guardian," whose primary disposition is one of extreme skepticism, and the goal is to rigorously stress-test this paper to find every potential flaw.

Review Form

Summary

This paper introduces WSC-LLM, a framework designed to co-explore hardware architecture and scheduling strategies for serving Large Language Models (LLMs) on wafer-scale chips. The authors identify the core trade-offs on wafer-scale hardware between computing, storage, and communication resources, particularly concerning DRAM capacity. The proposed solution, WSC-LLM, features a Central Scheduler and a Memory Scheduler. The Central Scheduler maps the distinct prefill and decoding phases of LLM inference to separate, appropriately configured hardware resource partitions. The Memory Scheduler manages the storage and transfer of the KV cache to improve memory utilization. The paper claims that this co-exploration framework significantly outperforms state-of-the-art (SOTA) systems, reporting an average performance improvement of 3.12x.

Strengths

The paper is well-structured and addresses a relevant and challenging problem. The core strengths are:

Clear Problem Motivation: The work does an excellent job of articulating the fundamental trade-offs in designing wafer-scale systems for LLMs, specifically the tension between DRAM capacity, D2D bandwidth, and the number of available compute dies on a fixed-size wafer (Section 2, Page 3).

Sound Conceptual Approach: Recognizing the distinct resource requirements of the prefill (computation-intensive) and decoding (memory-intensive) phases is critical. Proposing to partition hardware resources and optimize scheduling for each phase independently is a logical and sound approach to tackling this dichotomy (Section 3.1, Page 4).

Demonstrated Component Value: The ablation study effectively isolates and demonstrates the performance contributions of the proposed Central Scheduler and Memory Scheduler, showing that each component provides a clear benefit to the system (Figure 12, Page 10).

Weaknesses

Despite the strengths, the paper suffers from several critical weaknesses that undermine the validity and significance of its conclusions.

Unsupported Performance Claims Due to Unvalidated Simulation: The entire evaluation rests on a custom simulator. The paper states the performance evaluator is built upon "existing well established frameworks" (Section 5.1, Page 8) but crucially fails to name these frameworks, describe the validation process, or present any data correlating the simulator's output with the performance of real hardware. Without rigorous validation, all quantitative results, including the headline 3.12x performance improvement claim, are unsubstantiated.

Overstated "Co-Exploration" Contribution: The paper claims to be a "co-exploration framework", implying a deep, integrated search across both architecture and scheduling. However, the architectural exploration is limited to a simple parameter sweep across four pre-selected, hand-designed configurations (Table 1, Page 8). This is not co-exploration but rather a sensitivity analysis for the proposed scheduling algorithm. The framework does not appear to generate or discover novel architectural designs.

Ambiguous and Potentially Flawed Methodological Claims: Algorithm 1 is presented as a systematic optimization of resource partitioning. However, it contains contradictions. The algorithm's description claims to iterate through "all possible TP partition strategies" (Algorithm 1, Page 6), but the complexity analysis and description later refer to "pre-defined TP partition strategies" (Section 4.1, Page 6). This is a critical ambiguity. If the set of strategies is pre-defined and limited, the search is not exhaustive, and the optimality of the solution is not guaranteed. The paper criticizes prior work for relying on "engineering experience" but does not sufficiently prove that its own pre-defined strategies do not fall into the same category.

Insufficient SOTA Comparison: The primary SOTA comparison is against "Splitwise-Wafer," an adaptation of a single prior work for the wafer-scale context (Section 5.2, Page 9). This is a narrow baseline. To robustly claim a 3.12x improvement over "the" SOTA (Figure 10, Page 9), a comparison against a broader set of state-of-the-art distributed LLM serving systems, adapted for the same hardware assumptions, is required.

Questions to Address In Rebuttal

Please provide a detailed description of the validation methodology for your performance simulator, which you state is built on "well-established frameworks" (Section 5.1, Page 8). Present correlation data showing its accuracy against real, physical hardware (if available) or at least against other named, well-established, and validated simulation frameworks.

Can you justify the use of the term "co-exploration"? Does the framework have the capability to automatically generate and evaluate novel architectural configurations beyond the four hand-picked designs presented in your evaluation (Table 1, Page 8)? If not, a more accurate description of the contribution would be a "scheduling framework with architectural sensitivity analysis."

Please clarify the discrepancy regarding the tensor parallelism (TP) strategies used in your scheduling algorithm. Is the search truly exhaustive over "all possible" strategies (as stated in Algorithm 1, Page 6), or is it limited to a "pre-defined" set (as suggested in your complexity analysis in Section 4.1, Page 6)? If the latter, how was this set of strategies generated, and how can you be certain that it contains the global optimum?

The claim of being the "first work" to perform this co-exploration (Section 1, Page 2) is very strong. Please elaborate on the key differentiators between your work and any prior research that may have explored architecture and software scheduling simultaneously for large-scale models, even on different hardware targets.
Reply
K
In reply tokaru⬆:
Karu Sankaralingam @karu
2025-08-03 22:43:49.627Z
Review from the perspective of "The Synthesizer," placing the work in its broader academic and technological context.

Review Form

Summary

This paper presents WSC-LLM, a framework designed to optimize Large Language Model (LLM) inference on wafer-scale hardware. The core contribution is a scheduling system that recognizes and exploits the different computational characteristics of the two main phases of LLM inference: the "prefill" phase (processing the initial prompt) and the "decode" phase (generating subsequent tokens). WSC-LLM partitions the hardware resources of a wafer-scale chip, assigning compute-optimized partitions to the parallelizable prefill phase and memory-centric partitions to the bandwidth-bound decode phase. This is complemented by a memory scheduler that optimizes the storage and movement of the KV cache. The authors claim this approach yields a significant, 3.12x average performance improvement over baseline systems.

Strengths

From a conceptual standpoint, this work is well-aligned with the major trends and challenges in high-performance LLM serving.

Tackles a Critical, Forward-Looking Problem: As LLMs continue to grow, wafer-scale integration is a compelling hardware direction, promising to alleviate the communication bottlenecks inherent in multi-GPU clusters (as demonstrated by companies like Cerebras). This paper places itself at the intersection of this hardware trend and the critical need for efficient LLM serving software, which is a very relevant and forward-looking research area.

Embraces a Key Optimization Principle: The central idea of disaggregating or separating the prefill and decode phases is a powerful optimization principle that is gaining significant traction in the systems community. Research like Splitwise (arXiv:2311.18677) has shown the value of splitting these phases across different machines. This paper cleverly applies the same principle within a single, monolithic piece of hardware, which is a novel and interesting adaptation.

Sound Technical Foundation: The proposed solutions—a Central Scheduler to handle phase-based resource partitioning and a Memory Scheduler for KV cache management—are logical and directly address the primary bottlenecks identified in the paper's motivation (Section 2, Page 3). This shows a solid understanding of the underlying system dynamics.

Weaknesses

While the core idea has merit, the paper's execution and positioning could be significantly improved to better realize its potential impact.

Limited "Co-Exploration": The paper positions itself as a "co-exploration" framework, which sets a high bar. True co-design implies a feedback loop where software needs drive hardware architecture and vice-versa. However, the current work feels more like a sophisticated scheduling algorithm that is evaluated on a few pre-defined hardware configurations (Table 1, Page 8). The architectural exploration is not a primary output of the framework but rather an input. This is a missed opportunity to fully deliver on the "co-exploration" promise.

Narrow Contextualization: The paper's literature review primarily focuses on adapting a single baseline, "Splitwise-Wafer." However, it exists within a rich ecosystem of LLM serving systems. For instance, systems like Orca (OSDI '22) introduced continuous batching and iteration-level scheduling, while vLLM and its PagedAttention mechanism have become de facto standards for efficient KV cache management. This paper would be much stronger if it explicitly situated its contributions relative to these widely-known systems, explaining how its phase-based scheduling complements or improves upon their techniques in the specific context of wafer-scale hardware.

Lack of Hardware Validation: The evaluation is entirely simulation-based. While this is common in architecture research, the lack of correlation with real-world hardware or even a discussion of the fidelity of the "well established frameworks" (Section 5.1, Page 8) it's built on makes it difficult to gauge the real-world applicability of the results. Given that wafer-scale systems like those from Cerebras have unique properties (e.g., massive on-chip SRAM, proprietary interconnects), a discussion of how these real-world constraints might affect the scheduler's decisions is crucial for impact.

Questions to Address In Rebuttal

Your work builds on the powerful idea of separating prefill and decode phases, similar to Splitwise. Could you elaborate on how your intra-chip partitioning approach compares to Splitwise's inter-machine approach? What new challenges and opportunities arise when this separation happens on a single wafer with a high-bandwidth interconnect?

How do you see WSC-LLM's scheduling techniques integrating with or improving upon dominant serving frameworks like vLLM or Orca? Would your Central Scheduler replace their schedulers, or would it work in tandem, providing a lower-level hardware-aware optimization layer?

The future of wafer-scale hardware is still evolving. If you were to use your framework to propose an ideal wafer-scale architecture for LLM serving (e.g., the optimal ratio of compute dies to memory dies), what would it look like, and how would it differ from the configurations you tested?

Beyond throughput, what are the potential impacts of your approach on other important metrics, such as time-to-first-token (TTFT) and tail latency, especially in a multi-tenant environment with diverse workloads?
Reply
K
In reply tokaru⬆:
Karu Sankaralingam @karu
2025-08-03 22:44:11.968Z
Review from the perspective of "The Innovator," focusing exclusively on the novelty of the work presented.

Review Form

Summary

The authors present WSC-LLM, a framework whose primary novel claim is the "co-exploration" of LLM serving schedulers and wafer-scale hardware architectures. The specific mechanism proposed is a scheduling system that partitions a wafer into distinct hardware resource pools, dedicating separate partitions to the compute-heavy 'prefill' and memory-heavy 'decode' phases of LLM inference. This is supported by a memory scheduler intended to optimize KV cache placement on the wafer.

Strengths

From a novelty perspective, the core strength of this paper lies in its contextual adaptation, not its foundational concepts.

Novel Application Domain: The application of phase-disaggregation scheduling (separating prefill and decode) to the specific hardware topology of a monolithic wafer-scale chip appears to be new. While prior art has explored this concept in distributed systems across multiple machines, this paper is the first I have seen to formulate it as a resource partitioning problem within a single, tightly-integrated wafer. The "delta," or novel contribution, is therefore the translation of a known principle to a new and distinct hardware environment with unique communication and resource trade-offs.

Weaknesses

While the application context is new, the foundational ideas themselves are not. The novelty of the work is significantly diluted by the existence of prior art that addresses the core concepts.

Core Concept is Not New: The fundamental idea of identifying the distinct resource needs of the prefill and decode phases and scheduling them separately is prior art. Splitwise (arXiv:2311.18677) explicitly proposes splitting these phases onto different, appropriately-provisioned machines. Your work applies this same principle, but re-frames it as partitioning resources on a single wafer. While the implementation is different, the core conceptual insight is functionally identical.

"Co-Exploration" Claim is Overstated and Not Novel: The claim to be a "co-exploration" or "co-design" framework is a significant overstatement of the work's contribution. The methodology presented does not involve a novel algorithm for jointly discovering optimal software schedules and hardware architectures. Instead, it presents a scheduling algorithm whose performance is then evaluated across a small, pre-determined set of four hardware configurations (Table 1, Page 8). This is a sensitivity analysis, a standard practice, not a novel co-design methodology.

Memory Management Lacks Clear Novelty: The paper's description of its Memory Scheduler (Section 4.2, Page 7) does not articulate a clear, novel contribution. The field of KV cache management is extensive, with foundational works like vLLM and its PagedAttention providing sophisticated solutions for memory fragmentation, and numerous other works addressing KV cache offloading, compression, and sharing. The proposed scheduler's goal of optimizing placement based on topology is logical but does not, as described, introduce a fundamentally new technique or algorithm that advances the state of the art in memory management itself.

Questions to Address In Rebuttal

Please precisely articulate the novel "delta" of your work compared to Splitwise (arXiv:2311.18677). Given that Splitwise already established the benefit of separating prefill and decode phases, what is the fundamental new insight provided by your work, beyond applying this known principle to a different hardware substrate?

The term "co-exploration" implies a generative process where hardware and software designs are jointly optimized. Since your framework appears to test schedules on pre-defined architectures, can you defend this as a novel co-design methodology? Does your algorithm produce novel architectural configurations as an output?

What is the specific, novel mechanism in your Memory Scheduler that is not present in existing, widely-adopted KV cache management systems like vLLM/PagedAttention? Please pinpoint the algorithmic innovation that differentiates your approach from prior art in cache management.

You claim this is the "first work to co-explore the LLM service and architecture" for wafer-scale chips (Section 1, Page 2). While the specific combination may be unique, please clarify how your contribution is fundamentally different from the general body of work on hardware/software co-design for specialized accelerators, which has a long history in computer architecture.
Reply

ReplyAdd progress note

WSC-LLM: Efficient LLM Service and Architecture Co-exploration forWafer-scale Chips

Review Form

Review Form

Review Form