PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer
Link: https://dl.acm.org/doi/10.1145/3695053.3731045
Abstract: As cluster scales for LLM training expand, waferscale chips, characterized by the high integration density and bandwidth, emerge as a promising approach to enhancing training performance. The role of Network on Wafer (NoW) is becoming increasingly significant, which puts an emphasis on two facts: physical and logical topology. However, existing networks fail to co-design both aspects. Additionally, physical topology typically focuses on optimizing communication or computation separately, neglecting opportunities to improve overall training performance.
In this paper, we propose a physical design (PD) constraint-aware joint optimization strategy, developing mesh-switch physical topology and a dual-granularity logical topology. Mesh-switch leverages the high integration density of mesh and the efficient communication performance of fat tree, optimizing the allocation of on-chip communication and computation resources thoroughly considering the physical constraints of waferscale chips. Furthermore, we conduct a DSE algorithm to search for the optimal mesh-switch configuration. Based on the proposed physical topology, we design the most appropriate logical topology, and further enhance bandwidth utilization through a fine-grained overlap strategy. Evaluation results demonstrate that our NoW design achieves nearly a 2.39 × performance improvement in LLM training compared to existing networks. Our comprehensive design approach, which integrates physical and logical topologies with constraint considerations, can also be applied to network designs in other contexts.
- KKaru Sankaralingam @karu
Review of the paper "PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer," written from the perspective of "The Guardian."
Review Form
Summary
This paper argues that existing Network-on-Wafer (NoW) designs suffer from a lack of co-design between their physical and logical topologies, leading to suboptimal performance for large-scale LLM training. The authors propose a "mesh-switch" physical topology, which is a hybrid design featuring localized mesh-connected compute groups linked by a central, high-radix switch. This physical design is determined by a Design Space Exploration (DSE) algorithm that considers physical design (PD) constraints. Paired with this, they propose a "dual-granularity" logical topology. The authors claim this co-design approach achieves up to a 2.39x performance improvement in LLM training compared to existing networks.
Strengths
The paper is founded on a valid and important observation about the limitations of current NoW design methodologies.
- Correct Identification of a Core Problem: The central thesis—that physical and logical network topologies for wafer-scale systems must be co-designed to achieve optimal performance—is fundamentally sound (Section 1, Page 2). The critique of "orphan designs" where a sophisticated logical topology is mapped onto a suboptimal physical one (or vice-versa) is accurate and provides a strong motivation for the work.
- Logical High-Level Concept: The idea of a hybrid physical topology that combines the high-density local connectivity of a mesh with the global, high-bandwidth communication of a switched fabric is a reasonable approach to balancing wiring density and communication latency on a large-scale substrate.
Weaknesses
Despite a sound premise, the paper's conclusions are built on a foundation of flawed comparisons, questionable simulation assumptions, and an oversimplification of the very physical constraints it claims to address.
- Unsubstantiated Performance Claims due to Inequitable Baselines: The headline performance claim of 2.39x is fundamentally unsound as it is based on an unfair comparison. The proposed mesh-switch architecture is compared against baseline Mesh and FRED topologies that are not provisioned with equivalent resources. For example, the optimal mesh-switch configuration utilizes a specific area and compute die allocation determined by your DSE. A rigorous comparison would require evaluating the baselines after optimizing their parameters (e.g., channel width) for the same total wafer area and power budget. The paper does not provide this analysis. The performance gains are likely an artifact of a more favorable resource allocation for your proposed design, not a demonstrated architectural superiority.
- Overstated "PD Constraint-Aware" Contribution: The paper's title and core claim hinge on being "PD Constraint-aware." However, the constraints considered in the DSE are high-level estimations (e.g., area from die counts, power from analytical models) and do not appear to include critical, second-order physical design effects (Section 4, Page 6). There is no evidence of a detailed routability analysis, no modeling of the latency and signal integrity of the long global wires connecting mesh groups to the central switch, and no consideration of clock distribution challenges. Without modeling these true physical constraints, the DSE is merely an architectural parameter sweep, and the claim of being "PD Constraint-aware" is a significant overstatement.
- Simulation Fidelity is Unproven: The evaluation relies on an extended version of ASTRA-SIM (Section 7.1, Page 10). The paper fails to provide any validation of this extended model against a more detailed, cycle-accurate network simulator or real hardware. It is unclear how accurately the simulation captures the complex contention dynamics within the central switch or the true latency of traversing the hybrid physical paths. Without this validation, the quantitative results lack rigor and cannot be trusted.
- Logical Topology Lacks Clear Justification: The paper proposes a "dual-granularity" logical topology but provides insufficient evidence that this complex, hierarchical approach (e.g., Ring+Tree) is meaningfully better than simply applying a single, well-understood collective algorithm (like a tree-based all-reduce) across the entire fabric. The benefits of the "fine-grained overlap strategy" (Section 6.2, Page 11) are asserted but not rigorously quantified against simpler pipelining schemes on the same hardware. The added complexity of the dual-granularity logic is not justified by the results presented.
Questions to Address In Rebuttal
- Please provide a new baseline comparison where the Mesh and FRED topologies are re-optimized under the exact same total wafer area and power constraints as your best-performing mesh-switch configuration. This is necessary to isolate the architectural benefits from the resource allocation benefits.
- Your DSE considers area and power (Section 4, Page 6). Can you provide evidence from a physical design tool that your optimal mesh-switch topology is actually routable on a wafer? What is the estimated latency of the longest global wires connecting a mesh group to the central switch, and how is this critical physical constraint fed back into your performance model?
- How was your modified ASTRA-SIM framework validated (Section 7.1, Page 10)? Please provide correlation data comparing its performance predictions against a known, cycle-accurate network simulator for a comparable hybrid topology.
- Please provide a direct performance comparison between your proposed dual-granularity logical topology and a standard, non-hierarchical tree-based collective algorithm running on the exact same optimal mesh-switch physical topology. This is required to prove that the complexity of the dual-granularity approach provides a tangible benefit.
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Review of the paper "PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer," written from the perspective of "The Synthesizer."
Review Form
Summary
This paper presents a co-design methodology for creating efficient Networks-on-Wafer (NoW) for large-scale DNN training. The central argument is that prior work has failed to adequately co-design the physical (how components are placed and wired) and logical (how components communicate) network topologies, leading to suboptimal designs. To address this, the authors propose a holistic framework. First, a Design Space Exploration (DSE) algorithm, which incorporates high-level Physical Design (PD) constraints, is used to generate a hybrid "mesh-switch" physical topology. This topology combines local, high-density mesh-connected compute clusters with a powerful central switch for global communication. Second, a "dual-granularity" logical topology is mapped onto this physical substrate to optimize the execution of collective communication patterns common in DNN training. The work claims this co-design philosophy results in a superior architecture, improving performance by up to 2.39x over existing NoW designs.
Strengths
This paper makes a significant and timely contribution by tackling a problem at the heart of next-generation hardware design: the growing chasm between architectural intent and physical reality.
-
Holistic Co-Design Philosophy: The most important contribution of this work is its explicit and methodical focus on co-design (Section 1, Page 2). The field of computer architecture is littered with elegant "paper designs" that are impractical to build. This paper correctly identifies that for wafer-scale systems, where wire length and physical placement are dominant factors, one cannot simply design a logical network (like a fat-tree) and a physical layout independently. By creating a feedback loop where physical constraints inform the high-level architectural choices (Section 4, Page 6), this work provides a valuable conceptual blueprint for a more mature and realistic approach to designing wafer-scale systems. 🧐
-
Pragmatic Hybrid Architecture: The proposed "mesh-switch" physical topology is a well-reasoned and pragmatic compromise between two competing design points. It acknowledges the wiring-density advantages of a 2D Mesh for local communication (within a "mesh group") while leveraging the high-bandwidth, global connectivity of a switched fabric for long-distance communication. This hybrid approach is a logical evolution, situating itself as a middle ground between the simplicity of a pure Mesh (used in Google TPUs) and the complexity of a full fat-tree (proposed in papers like FRED).
-
Connects to Broader System Trends: The work connects beautifully to several major trends in computing. The need for specialized interconnects echoes the development of technologies like NVIDIA's NVSwitch in the discrete GPU world. The focus on co-designing hardware for specific communication patterns (i.e., DNN collectives) is a hallmark of the broader movement towards domain-specific architectures. This paper effectively takes principles from distributed HPC systems and domain-specific hardware design and applies them to the unique context of a monolithic wafer.
Weaknesses
While the vision is compelling, the paper could be strengthened by broadening its contextualization and exploring the deeper implications of its proposed design.
-
The Software Abstraction Challenge: The paper proposes a sophisticated and heterogeneous physical network. A key challenge, which is not fully explored, is what abstraction this hardware should present to the software stack (e.g., the DNN framework and compiler). Should the compiler be aware of the mesh groups and the central switch to optimize data placement and communication scheduling? Or should the complexity be hidden behind a simpler logical network view? The proposed "dual-granularity" logical topology is a step in this direction, but a deeper discussion of the trade-offs in this software/hardware interface would be beneficial.
-
Limited Exploration of the Design Space: The "mesh-switch" is presented as a solution. However, it is one point in a vast design space of hybrid topologies. It would be interesting to see a discussion of other potential hybrid models. For instance, what about a hierarchy of switches? Or a dragonfly-like topology adapted for a 2D wafer? Situating the mesh-switch within this broader family of hybrid designs would help clarify why it represents the most promising approach.
-
Physical Realizability at Scale: While the paper's focus on PD constraints is a major strength, it still relies on high-level analytical models for area and power. As acknowledged, a full physical layout is complex. The paper would be more impactful if it included a discussion of the second-order physical challenges that its DSE does not capture—for example, the immense challenge of clock and power distribution on a wafer with a large central switch, or the signal integrity of the very long wires required for global communication.
Questions to Address In Rebuttal
-
Your work proposes a compelling co-design methodology. Looking forward, how do you envision this co-design process evolving? Could machine learning models be used to create more accurate and faster PD-aware DSE, creating a tighter and more automated design loop?
-
The "mesh-switch" architecture creates distinct local and global communication domains. How could a future DNN training framework exploit this hierarchy? For example, could it map the tensor-parallel parts of a model within mesh groups to leverage local bandwidth, while using the global switch exclusively for pipeline-parallel communication?
-
This paper focuses on LLM training. How do you think the optimal physical/logical topology would change for other important large-scale workloads, such as scientific simulations (e.g., weather forecasting) or large graph analytics, which have different communication patterns?
-
If you were to design the ideal programming model for your heterogeneous fabric, what would it look like? What new primitives or abstractions would you expose to the programmer or compiler to allow them to take full advantage of the underlying physical topology? 🚀
-
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Review of the paper "PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer," written from the perspective of "The Innovator."
Review Form
Summary
This paper puts forth a co-design methodology for Networks-on-Wafer (NoW). The central novel claim is the methodology itself: a "PD Constraint-aware" Design Space Exploration (DSE) that jointly optimizes a physical and logical network topology for wafer-scale LLM training (Abstract, Page 1; Section 1, Page 2). The outputs of this claimed novel process are: 1) a hybrid "mesh-switch" physical topology, which combines local mesh-connected compute groups with a central high-radix switch (Section 5, Page 8), and 2) a "dual-granularity" logical topology designed to map collective operations efficiently onto this physical substrate (Section 6, Page 9).
Strengths
The novelty of this work does not lie in the invention of a single new component, but rather in the synthesis of multiple existing ideas into a formal co-design framework.
- Novel Methodology as the Contribution: The most significant "delta" in this paper is the explicit formulation of the NoW design problem as a PD-constrained co-design task (Section 4, Page 6). While prior work has designed physical layouts (e.g., Mesh) or proposed logical topologies (e.g., FRED's fat-tree), this paper is the first I have seen to formalize the search for an optimal pairing of the two, using high-level physical design metrics to prune the vast design space. The innovation is the methodology itself—framing the problem in a way that acknowledges the interdependence of the physical and logical layers from the outset. This is a crucial step towards maturing the design of wafer-scale systems. 🧠
Weaknesses
While the co-design framework has a spark of novelty, it is built entirely from components and concepts that are well-established prior art. The work's claims to novelty are significantly undermined by the fact that it invents very little at the component level.
- Hybrid Topology is Not New: The proposed "mesh-switch" physical topology is a straightforward hybrid of two canonical network designs: a mesh and a switched fabric. Hybrid on-chip topologies that combine the properties of different network structures have been explored for years in the Network-on-Chip (NoC) literature to balance local and global traffic. The novelty here is the application to a wafer, but the architectural concept itself is an adaptation of existing ideas, not a new invention.
- Logical Topology is an Implementation Detail, Not an Invention: The "dual-granularity" logical topology (Section 6, Page 9) is a complex mapping strategy for collective operations. However, hierarchical or multi-level collective algorithms are standard practice in HPC and distributed systems to match the communication pattern to the physical network hierarchy. This is an implementation choice for a collective library, not a fundamental new logical topology. It is a software technique, not a hardware innovation.
- "PD-Aware" Claim is Overstated: The DSE is described as "PD-aware," but the constraints it considers—area and power derived from analytical models like CACTI (Section 4.2, Page 7)—are high-level estimates. This is a common practice in early-stage architectural exploration. A truly novel "PD-aware" co-design would incorporate more challenging, second-order physical effects like wire routability, signal integrity of long global links, or clock distribution, none of which are modeled here. As presented, this is a standard architectural DSE, not a breakthrough in physical design co-optimization.
- Performance Gains Are Not a Novel Insight: The paper's headline performance improvements (e.g., 2.39x) are not the result of a novel architectural principle but are an artifact of the DSE finding a more favorable resource allocation (e.g., die count, switch size) for its own proposed architecture than for the baselines. It is not a novel discovery that a purpose-built configuration outperforms a generic one. The novelty of the architecture itself cannot be demonstrated by comparing it to baselines that were not subjected to the same optimization process.
Questions to Address In Rebuttal
- The core of your novelty claim rests on the "PD Constraint-aware" DSE (Section 4, Page 6). What is the fundamental difference between your methodology and standard architectural Design Space Exploration that has been used for decades to trade off area, power, and performance?
- Hybrid mesh-switch NoC architectures have been proposed in prior work. What is the specific, novel architectural insight of your "mesh-switch" (Section 5, Page 8) that distinguishes it from these prior hybrid designs, beyond its application to a wafer?
- The "dual-granularity" logical topology (Section 6, Page 9) appears to be a software-level mapping strategy for collectives. How is this fundamentally different from the hierarchical collective algorithms used in MPI libraries for HPC clusters, and why do you consider it a novel hardware/logical topology?
- Given that the constituent ideas (hybrid networks, hierarchical collectives, architectural DSE) are known, the primary novelty claim is their synthesis. If a competing group were to take a standard Mesh topology and apply the same PD-aware DSE to optimize its channel width and resource allocation, what evidence do you have that your more complex mesh-switch would still provide a significant performance benefit?