Forecasting GPU Performance for Deep Learning Training and Inference
Deep
learning kernels exhibit a high level of predictable memory accesses
and compute patterns, making GPU's architecture well-suited for their
execution. Moreover, software and runtime system for GPUs further enable
optimizations that aim to better ...ACM DL Link
- KKaru Sankaralingam @karu
Title: Forecasting GPU Performance for Deep Learning Training and Inference
Reviewer: The Guardian (Adversarial Skeptic)
Summary
The authors present NEUSIGHT, a framework for forecasting the performance of deep learning workloads on GPUs, with a particular focus on predicting latency for unseen models and hardware. The core methodology deviates from prior work by not predicting kernel latency directly. Instead, it decomposes kernels into smaller units called "tiles," uses a Multi-Layer Perceptron (MLP) to predict the hardware utilization for a single tile, constrains this prediction using fundamental performance laws (i.e., roofline), and then aggregates these per-tile estimates to derive the end-to-end kernel latency. The authors claim this approach significantly reduces prediction error compared to state-of-the-art methods, citing a dramatic improvement from 121.4% to 2.3% for GPT-3 training on an H100 GPU.
Strengths
- Sound De-construction of the Problem: The high-level insight to decompose the complex problem of kernel latency prediction into smaller, more manageable units (tiles) is conceptually strong. Predicting a bounded utilization value is a more constrained problem for a machine learning model than predicting an unbounded latency value directly, which could contribute to better generalization.
- Extensive Empirical Evaluation: The paper is evaluated on a comprehensive set of modern GPUs (including H100, L4, and AMD MI-series) and relevant large language models (BERT, GPT variants, etc.). The inclusion of out-of-distribution test cases for both hardware and models is a necessary and welcome component of the evaluation.
- Clarity of Presentation: The paper is generally well-written, and the overall workflow of the NEUSIGHT framework is clearly articulated, particularly in Figure 6.
Weaknesses
My primary concerns with this paper stem from critical methodological assumptions that appear to be either unjustified or oversimplified, and a potential overstatement of the framework's capabilities, especially concerning "unseen" scenarios.
-
The "Oracle" of Tile Dimensions: The entire framework is predicated on knowing the tile dimensions for a given kernel on a target GPU. The authors state that "The tile dimensions are determined by metadata obtained with PyTorch Profiler" (Section 4.1, page 7) and for prediction, they "estimate tile sizes by finding the closest match in the database" (Section 6.1, page 10). This is a critical flaw that undermines the central claim of predicting performance on unseen GPUs and for new models.
- This approach is not predictive; it is reactive. It requires a pre-existing, comprehensive database of profiled tile configurations. If a new version of cuDNN or CUTLASS introduces a novel tiling strategy, or if a truly new kernel is developed, this database would contain no relevant entry. The system would be unable to make a prediction. This is a significant limitation that is not adequately addressed. The framework does not predict performance from first principles of the hardware and kernel, but rather pattern-matches against previously observed implementations.
-
Unjustified Functional Form for Utilization: The core of the prediction model relies on Equations 7 and 8, which model utilization as
utilization = alpha - beta / num_waves. The paper provides no theoretical or microarchitectural justification for this specific hyperbolic relationship. This functional form appears to be an arbitrary, empirical curve-fit to the data shown in Figure 5. While it may fit the observed data, there is no guarantee it will generalize to different kernel types, new GPU architectures with different scheduler designs, or memory subsystems. The claim of "imposing performance laws" is therefore weak; the framework imposes an assumed curve whose parameters are predicted by an MLP, and then multiplies the result by a performance bound. -
Oversimplified Latency Aggregation: Equation 4 (
PerOpLatency = PerTileLatency × num_waves) presents a grossly simplified model of execution. It assumes that waves of tiles execute in a perfectly sequential and independent manner. This model completely ignores:- Pipeline stalls and overheads between waves.
- Memory contention that may not scale linearly with the number of waves.
- Complex hardware scheduling effects where the GPU might overlap execution of tiles from different waves or manage resources in a non-linear fashion.
This assumption is a significant abstraction that is unlikely to hold true in all cases, yet it is presented without validation or sensitivity analysis.
-
Inconsistent and Potentially Misleading Error Reporting: The abstract and conclusion highlight exceptionally low error rates (e.g., 2.3%). However, a closer inspection of the results reveals a more complex picture. Table 7 (page 12) on operator fusion shows prediction errors as high as 24.6% for BERT-Large and 19.4% for GPT2-Large on the H100. Operator fusion is a standard and critical optimization in modern deep learning frameworks. The fact that the model's error rate increases by an order of magnitude for these common cases suggests a fundamental weakness in the methodology when dealing with kernels that deviate from the simple, well-structured operations used to build the core models. These high-error results are not reconciled with the headline claims.
-
Trivial Distributed Performance Model: The extension to distributed execution (Section 5.1, page 9) is superficial. The network performance estimation is a basic analytical model based on peak bandwidth and a utilization factor. The claim that NEUSIGHT can be integrated with simulators like ASTRA-Sim is an assertion, not a contribution. Furthermore, the multi-node results presented in Table 9 are unvalidated estimations, which adds little value to the paper's core claims.
Questions to Address In Rebuttal
-
On Tile Dimensions: The methodology relies on a pre-populated tile database from a profiler. How does the framework handle a genuinely novel kernel from a new library (e.g., a future version of FlashAttention) whose tiling strategy is not represented in the training data? How can the claim of forecasting for "unseen models" be justified if the underlying kernel implementations must be known a priori?
-
On the Utilization Model: Please provide a theoretical or microarchitectural justification for the specific functional form
utilization = alpha - beta / num_waves. Why was this form chosen over other potential models? Have the authors tested its robustness on kernels where performance does not scale smoothly with concurrency? -
On Latency Aggregation: The sequential wave execution model (Equation 4) is a strong simplification. Can the authors provide evidence from hardware counters or microbenchmarks that this model holds true across different architectures and that it does not introduce significant, unaccounted-for error, especially for memory-intensive kernels?
-
On Inconsistent Error Rates: Please explain the significant discrepancy between the low overall errors reported in Figure 7 and the much higher errors for fused operators in Table 7 (e.g., ~24% for BERT/GPT-2 on H100). Does this suggest a fundamental limitation in modeling kernels that deviate from standard GEMM structures?
-
On Generalization to New Libraries: How would NEUSIGHT's performance be affected by a major update to a kernel library like cuDNN, which might fundamentally change the tiling strategies for a given operation and GPU? Would the entire tile database and MLP models need to be regenerated and retrained, thus limiting the framework's practical utility for forward-looking predictions?
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Reviewer: The Synthesizer (Contextual Analyst)
Summary
The authors present NEUSIGHT, a framework for forecasting the performance of deep learning workloads on GPUs, with a particular focus on predicting latency for unseen models on unseen hardware. The core contribution is a methodological shift away from prior work that uses monolithic machine learning models to directly predict end-to-end kernel latency. Instead, NEUSIGHT leverages a key architectural insight: modern GPU libraries execute large kernels by decomposing them into smaller, independent work units called 'tiles'.
The framework's prediction process is decomposed accordingly. It first predicts performance at the tile granularity. Crucially, it does not predict latency directly. Instead, it uses a small MLP to predict the hardware utilization, a more constrained and physically grounded variable. This prediction is then used within the framework of fundamental performance laws (i.e., a Roofline model) to calculate tile latency. These tile-level predictions are then aggregated to estimate the latency of the full kernel and, subsequently, the entire model. The authors demonstrate through a comprehensive evaluation that this architecturally-aware, hybrid approach dramatically reduces prediction error compared to state-of-the-art baselines, especially in challenging out-of-distribution scenarios involving new GPUs like the H100.
Strengths
-
Principled Hybrid Approach: The paper's primary strength is its departure from treating performance prediction as a pure black-box regression problem. By grounding the prediction in the physical reality of GPU execution—tiled decomposition and performance bounds defined by FLOPs and memory bandwidth—the authors create a model that is far more robust and generalizable. Using machine learning for the component that is hardest to model analytically (the non-linear relationship between workload size and hardware utilization) while relying on established performance laws for the rest is an elegant and powerful synthesis.
-
Excellent Generalization: The most compelling evidence for the framework's success is its remarkable accuracy on out-of-distribution workloads and hardware. The results for predicting GPT-3 performance on the H100 GPU (Section 6.2, page 10), where neither was part of the training set, are particularly impressive. This demonstrates that the model has learned something fundamental about the relationship between kernel parameters and GPU architecture, rather than simply overfitting to a specific set of hardware. The extension to AMD GPUs (Figure 9, page 12) further strengthens this claim, suggesting the underlying principles are vendor-agnostic.
-
Significant Advancement over Prior Art: The paper does an excellent job of positioning its contribution. It not only cites prior work but empirically demonstrates its limitations (Figure 2, page 3). By showing that both linear regression and more complex MLP-based approaches fail to generalize, the authors build a strong case for why a new methodology is needed. The orders-of-magnitude reduction in prediction error presented in the evaluation is not an incremental improvement; it represents a significant step forward for the field.
-
High Potential Impact: This work addresses a critical and commercially relevant problem. The ability to accurately forecast the performance of new, large models on next-generation or access-constrained hardware is invaluable for hardware procurement, cloud resource allocation, and ML model co-design. NEUSIGHT provides a practical tool that could directly influence billion-dollar decisions for large technology companies and research institutions. The framework's ability to plug into larger system simulators for distributed training (Section 6.3, page 12) further broadens its utility.
Weaknesses
While the work is strong, its long-term viability hinges on a few assumptions that could be points of fragility.
-
Dependence on Tile Metadata: The entire methodology is predicated on the ability to extract or infer tile dimensions from kernel metadata (e.g., kernel names from the PyTorch Profiler, as mentioned in Section 6.1, page 10). This dependency is a potential weakness. Future compilers or GPU libraries could easily change naming conventions, obfuscate this information, or adopt more dynamic tiling strategies, which might break the current extraction process and require significant re-engineering.
-
Extensibility to Novel Operator Types: The framework uses specialized MLPs for different classes of common DNN operators. While the paper mentions a fallback strategy for unknown operators (treating them as memory-bound), this is a coarse approximation. The true test of a predictive model is its ability to handle novelty. The performance on fundamentally new types of kernels—for instance, those arising from sparse computation, graph-based models, or non-transformer architectures—remains an open question.
-
Simplified Aggregation Model: The model aggregates tile latencies by assuming sequential execution of "waves" of tiles (Equation 4, page 7). While this seems to work exceptionally well, it abstracts away complex microarchitectural interactions, such as L2 cache contention between concurrently executing tiles or memory controller scheduling effects. The current success might be partially attributable to the regular, dense nature of transformer workloads. The model might be less accurate for workloads that create more resource contention between tiles.
Questions to Address In Rebuttal
-
Could the authors comment on the robustness of their tile-size extraction methodology? How might the NEUSIGHT framework adapt if future GPU software stacks (e.g., CUDA 14, new versions of cuDNN) were to change kernel naming conventions or make this metadata less accessible?
-
The multi-node execution predictions for a large-scale GPT-3 deployment (Table 9, page 13) are a fascinating projection but could not be validated. Beyond the network model itself, what aspect of scaling from a single server to thousands of nodes do the authors believe introduces the most uncertainty into their predictions? For example, are there concerns about tail latency effects or interactions between compute and communication that are not captured in the current model?
-
While the framework handles various standard DNN kernels, could you elaborate on its limitations when faced with operators that have highly irregular memory access patterns or control flow? How would the concept of a uniform 'tile' and the associated Roofline bounds apply in such scenarios?
-
- KIn reply tokaru⬆:Karu Sankaralingam @karu
Paper Title: Forecasting GPU Performance for Deep Learning Training and Inference
Review Form: The Innovator
Summary
The authors present NEUSIGHT, a framework designed to forecast the performance of deep learning models on unseen GPUs. The central claim to novelty lies in the framework's core methodology. Instead of applying machine learning directly to predict the end-to-end latency of a full DNN kernel—a common approach in prior work—NEUSIGHT decomposes the problem. It identifies that GPU libraries execute large kernels as a collection of smaller, independent work units, which the authors term 'tiles.' The framework's key contribution is to predict performance at this finer tile granularity. Specifically, it uses a Multi-Layer Perceptron (MLP) not to predict a raw latency value, but to predict a utilization factor. This predicted utilization is then used to scale a theoretical performance bound derived from fundamental hardware limits (i.e., the roofline model). These tile-level predictions are subsequently aggregated to produce a forecast for the entire kernel and, ultimately, the full model.
Strengths
The primary strength of this paper is the novelty of its core architectural insight. It moves beyond the prevalent black-box, end-to-end prediction paradigm for GPU kernels and introduces a more physically-grounded, gray-box approach. The specific points of novelty are:
-
Problem Decomposition: The decision to model performance at the tile level, rather than the kernel level, is a significant departure from cited prior art such as Habitat [62] and Li et al. [26]. While the concept of tiling itself is fundamental to GPU programming (e.g., in CUTLASS), its application as the foundational unit for an ML-based performance predictor appears to be a novel contribution. This decomposition simplifies the learning task, as the model only needs to generalize across a smaller set of tile dimensions rather than a vast space of kernel dimensions.
-
Framing the Prediction Target: The most elegant element of novelty is the choice to predict a bounded utilization factor instead of an unbounded latency value. Prior work that attempts to directly predict latency forces the ML model to learn the complex, non-linear physics of the underlying hardware. By instead predicting a value between 0 and 1 that scales a known theoretical bound (the roofline), the authors constrain the problem in a principled way. This is a conceptually significant advance, as it likely accounts for the framework's superior generalization to out-of-distribution GPUs and workloads, which is the primary failure mode of existing methods.
Weaknesses
While the core idea is novel, its foundation and implementation raise questions about the robustness and generality of the contribution.
-
Dependence on Existing Software Conventions: The novelty is tightly coupled to the current implementation paradigm of GPU libraries like cuDNN and CUTLASS, which rely on tiling. The framework's ability to extract tile dimensions is contingent on metadata from tools like the PyTorch Profiler (Section 6.1, page 10). This introduces a degree of brittleness. The proposed novel method is not a first-principles model of GPU execution but rather a model of a specific software ecosystem's execution strategy. A future shift in library design—for example, a move to dynamic or irregular tiling strategies—could potentially invalidate the core assumptions of the framework. The contribution is thus more of a highly effective engineering solution tied to current software, rather than a fundamental and enduring modeling technique.
-
Ambiguity in the "Tile" Abstraction: The paper is convincing for GEMM operators, where the concept of a tile is crisp and well-defined. However, the generalization of this novel abstraction to other operators is less clear. The framework uses five separate MLPs for different operator classes. For element-wise or reduction operators, the notion of a "tile" is conceptually different and less uniform than for matrix multiplication. The paper does not sufficiently detail how the tile-based decomposition is novelly and consistently applied to these other operator types, especially in the context of operator fusion (Section 4.4, page 8), where the control and data flow can become highly complex. The claim of a general "tile-granularity" approach feels somewhat over-extended when its clearest articulation is for GEMM.
-
Compositional Novelty: The contribution is a novel combination of existing concepts. The roofline model [59] is decades-old. MLPs for performance prediction are common. The concept of tile-based execution on GPUs is standard. The novelty here is in the synthesis. While effective, this is an incremental, architectural form of novelty, not a radical new theory of performance modeling.
Questions to Address In Rebuttal
To strengthen the claims of novelty and robustness, the authors should address the following:
-
Generality of the Tile Abstraction: How does the tile-based prediction framework handle operators where the concept of a static, regular "tile" is ill-defined? For instance, how would it model a sparse matrix operation or a complex, hand-optimized kernel from a library like Triton, which may not expose clear tiling metadata in the way that standard cuDNN GEMM kernels do?
-
Future-Proofing the Contribution: The method's dependency on tile metadata from existing profilers is a potential weakness. How would NEUSIGHT adapt if a future version of a major library (e.g., CUTLASS 4.0) fundamentally changes its tiling strategy or ceases to expose this metadata? Is the novel approach extensible to inferring tiling strategies, or is it permanently reliant on reverse-engineering specific library implementations?
-
Operator Fusion Complexity: The description of handling operator fusion in Section 4.4 appears to be an oversimplification (e.g., summing FLOPs and using the first operator's predictor). A fused kernel is not simply the sum of its parts; it has a unique execution profile. Could the authors provide a more detailed example of how the tile-level prediction model, which is the core novel idea, is applied to a non-trivial fused kernel, such as a convolution followed by a bias addition and a ReLU activation? Does a single "tile" prediction still make sense in this context?
-