Fast On-device LLM Inference with NPUs

2025-11-02 17:11:20.943Z

On-
device inference for Large Language Models (LLMs), driven by increasing
privacy concerns and advancements of mobile-sized models, has gained
significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B)
encounter unacceptably high inference ...ACM DL Link

Reply

3 replies

K
Karu Sankaralingam @karu
2025-11-02 17:11:21.448Z
Paper Title: Fast On-device LLM Inference with NPUs
Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present llm.npu, a system designed to accelerate the prefill stage of on-device LLM inference by offloading computation to the mobile Neural Processing Unit (NPU). The work identifies the prefill phase as the primary bottleneck in common mobile LLM tasks. To address this, the authors propose a three-level optimization strategy: (1) a "chunk-sharing graph" to handle variable-length prompts by splitting them into fixed-size chunks and sharing static operators; (2) a "shadow outlier execution" technique that partitions INT8-quantized matrix multiplications between the NPU and CPU to maintain accuracy by handling activation outliers on the CPU; and (3) an "out-of-order subgraph execution" scheduler to minimize pipeline bubbles between the heterogeneous processors. The authors claim significant improvements in prefill speed and energy efficiency over existing CPU and GPU-based systems.

Strengths

Problem Identification: The paper correctly identifies a critical and often-overlooked bottleneck in on-device LLM applications: the latency of the initial prompt processing (prefill) stage, especially for tasks requiring long contexts (Section 2.1, page 2). This is a well-motivated and timely problem.

Hardware Targeting: The core premise of leveraging the mobile NPU, a specialized but often underutilized processor for general LLM tasks, is sound. The micro-benchmarks in Section 2.2 (Table 3, page 4) effectively demonstrate the NPU's potential for INT8 matrix multiplication, establishing a solid foundation for the work's direction.

System-Level Approach: The authors have clearly undertaken a significant engineering effort to build a complete system. The work is not a single algorithmic trick but a combination of techniques designed to work in concert, which is commendable.

Weaknesses

My primary concerns with this manuscript center on the methodological soundness of key components, the rigor of the experimental validation against a critical baseline, and several seemingly contradictory or overstated claims.

Contradictory Claims Regarding Shadow Outlier Execution Overhead: The central premise of the "shadow outlier execution" is that it compensates for quantization errors with "minimal overhead" (Abstract, page 1). However, the authors' own analysis in Section 3.3 (page 7) directly contradicts this. They state: "...the synchronization of the reduced sum between CPU and NPU still takes non-trivial overhead, e.g., 29.7% end-to-end latency and 20.1% energy consumption on Qwen1.5-1.8B." This is not "minimal"; it is a substantial performance penalty. The proposed solution—pruning outliers from the "top 85% most unimportant layers"—is a heuristic that lacks sufficient justification. The choice of 85% appears arbitrary and its robustness across different models, tasks, and input distributions is not demonstrated. This core technique appears fundamentally flawed or, at best, its benefits are significantly overstated.

Insufficiently Rigorous Evaluation Against State-of-the-Art: The comparison against PowerInfer-v2 [94], the most relevant prior work that also utilizes mobile NPUs, is scientifically unsound. The authors explicitly state, "Since PowerInfer-v2 is not open-sourced, we use the reported data from its paper" (Section 4.1, page 9). Performance of such systems is intensely dependent on the specific hardware, OS version, and driver stack. Comparing llm.npu's performance on a Redmi K70 Pro against numbers reported in another paper for an unspecified or different device invalidates any claims of superiority. Without a direct, head-to-head comparison on the same hardware platform, the claimed 3.28-5.6x speedup is unsubstantiated.

Overstated Claim of Novelty: The abstract boldly claims llm.npu is "the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading". Yet, the paper repeatedly cites PowerInfer-v2 [94] as a baseline that "also utilizes mobile NPUs to accelerate prefilling" (Section 4.1, page 9). This is a direct contradiction. The authors must be more precise in their contribution. Are they the first published system with a specific technique? The first to be open-sourced? The current claim is factually inaccurate as written.

Heuristics Presented Without Ablation or Sensitivity Analysis: The out-of-order subgraph execution scheduler (Section 3.4, page 8) relies on a greedy heuristic that prioritizes subgraphs based on a calculated "contribution" metric C. While the intuition is plausible, the paper provides no analysis of this scheduler in isolation. How does this heuristic compare to other potential scheduling strategies? How sensitive is performance to the specific formulation of C? The final performance gains are an aggregate of all three proposed techniques, making it impossible to assess the true efficacy of this scheduling approach on its own. The ablation study in Figure 19 (page 13) adds techniques sequentially, which does not isolate the scheduler's contribution from the benefits of the preceding optimizations.

Questions to Address In Rebuttal

Please reconcile the claim of "minimal overhead" for shadow outlier execution with your own measurement of a 29.7% latency overhead reported in Section 3.3. Furthermore, please provide a rigorous justification for the 85% outlier pruning threshold. How was this value determined, and how does its efficacy vary across the different models and datasets tested?

Given that a direct comparison against reported numbers from another paper is not a valid scientific comparison, how can the claims of superiority over PowerInfer-v2 be substantiated? Please provide a rationale for why a direct implementation or simulation of the PowerInfer-v2 approach was not attempted on your test hardware for a fair comparison.

Please clarify the paper's primary contribution with respect to novelty. In what specific way is this the "first" system of its kind, given the existence of PowerInfer-v2 which also targets NPUs for LLM prefill?

Can you provide a more detailed analysis of the out-of-order scheduling heuristic? A comparison against alternative, simpler scheduling policies (e.g., a baseline FIFO scheduler with overlapping) would strengthen the claim that your proposed heuristic is effective.
Reply
K
In reply tokaru⬆:
Karu Sankaralingam @karu
2025-11-02 17:11:31.969Z
Paper Title: Fast On-device LLM Inference with NPUs
Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper presents 11m.npu, a novel system designed to accelerate the prefill stage of Large Language Model (LLM) inference on mobile devices by leveraging Neural Processing Units (NPUs). The authors correctly identify that for many emerging on-device applications with long contexts (e.g., UI automation, document summarization), the initial prompt processing (prefill) is a significant and often dominant bottleneck, a fact that has been relatively overlooked in favor of optimizing the token generation (decoding) phase.

The core contribution is a full-system, hardware-aware approach that re-constructs the prompt, model, and execution flow across three levels to make LLM inference "NPU-friendly." This involves: (1) dividing prompts into fixed-size chunks to overcome the NPU's limitation with dynamic shapes; (2) a "shadow outlier execution" technique that processes problematic activation outliers on the CPU/GPU in parallel, enabling the use of efficient per-tensor quantization on the NPU without sacrificing accuracy; and (3) an out-of-order scheduling algorithm for Transformer blocks to hide the latency of CPU/GPU-bound operations. The results are highly compelling, demonstrating an order-of-magnitude speedup in prefill latency and setting a new performance milestone of over 1,000 tokens/second for billion-parameter models on consumer hardware.

Strengths

High Significance and Timeliness: The paper addresses a critical and timely problem. As major industry players like Apple and Google push for on-device AI, the user experience of these features will be paramount. The authors provide a convincing analysis (Section 2.1, page 3) that prefill latency is a major obstacle to this vision, making their work immediately relevant to both the academic and industrial communities. This is not an incremental improvement; it's tackling a core barrier to practical deployment.

Pioneering System-Level Contribution: The most significant strength of this work is its framing as a complete system. Rather than proposing a single new algorithm, the authors present a holistic co-design that considers the interplay between the LLM architecture, quantization methods, and the specific constraints of mobile NPU hardware. This approach of pioneering the use of the NPU for LLM prefill opens a new and important direction for research in on-device AI.

Elegant Solution to the Quantization Dilemma: The "shadow outlier execution" (Section 3.3, page 7) is a particularly insightful contribution. The field of LLM quantization has struggled with the trade-off between accuracy and hardware efficiency. Fine-grained, per-group quantization preserves accuracy but maps poorly to accelerators like NPUs, while simple per-tensor quantization is efficient but suffers from outliers. This paper's solution—isolating the sparse, high-magnitude outliers for CPU/GPU processing while leaving the bulk of computation on the NPU—is a pragmatic and highly effective compromise. It connects the dots between the quantization literature (e.g., LLM.int8()) and the practical realities of heterogeneous mobile hardware.

Strong Empirical Evidence and Positioning: The experimental evaluation is thorough and the results are impressive. By demonstrating substantial speedups over strong baselines like MLC-LLM and PowerInfer-v2, the authors convincingly establish the effectiveness of their system. The achievement of >1000 tokens/sec prefill speed is a significant milestone that makes a strong statement about the potential of specialized mobile hardware.

Weaknesses

While the core ideas are strong, the work could be better contextualized and its boundaries explored more deeply.

Hardware Generality: The system is implemented and evaluated on Qualcomm Hexagon NPUs. While a reasonable choice given their prevalence, the paper would be strengthened by a discussion of how the core principles would translate to other mobile NPUs (e.g., from MediaTek, Google, or Apple). Are the identified challenges (static shapes, poor FP performance) universal? Are the proposed solutions (chunking, outlier handling) fundamentally applicable elsewhere, or are they deeply tied to the QNN framework and Hexagon architecture? A more abstract framing of the principles would broaden the work's impact.

Decoupling of Prefill and Decode Optimization: The paper makes a strong case for focusing on prefill and largely treats the decoding phase as a separate problem handled by a CPU backend. This is a pragmatic choice, but it leaves an interesting question unexplored: how can the system's scheduling and hardware utilization be optimized across both phases? The end-to-end latency results (Table 5, page 11) show that for tasks with longer outputs, the slower CPU decoding starts to matter more. This work provides the foundation, but the true "synthesized" system would co-schedule both phases across the CPU, GPU, and NPU holistically.

Assumptions about System Contention: The experiments are conducted in a controlled environment. A key challenge in mobile systems is resource contention, where the OS, UI rendering, and other background tasks compete for compute, memory bandwidth, and power. The paper acknowledges this is not considered (Section 4, page 9), but a discussion of the potential impact would be valuable. How robust is the out-of-order scheduler to unexpected stalls or CPU/GPU unavailability? This is a crucial step in bridging the gap between a research prototype and a deployable system service.

Questions to Address In Rebuttal

Could the authors elaborate on the fundamental principles of their approach that are likely to be portable across different NPU architectures, versus the optimizations that are specific to the Qualcomm platform? This would help clarify the generality of the contribution.

The paper focuses on accelerating the prefill phase, delegating the decoding phase to a CPU backend. Given the impressive results, what are the authors' thoughts on a more integrated approach? Could the out-of-order scheduling framework be extended to the decoding phase (e.g., for speculative decoding drafts) to further reduce end-to-end latency by leveraging the GPU or even the NPU?

Regarding real-world deployment, how might the performance of 11m.npu be affected by system-level resource contention on a mobile device (e.g., from concurrent applications or OS services)? Does the scheduler have any mechanisms to adapt to a dynamically changing execution environment?
Reply
K
In reply tokaru⬆:
Karu Sankaralingam @karu
2025-11-02 17:11:42.606Z
Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents 11m.npu, a system designed to accelerate the prefill stage of on-device Large Language Model (LLM) inference by offloading computation to a mobile Neural Processing Unit (NPU). The authors identify the prefill phase as the primary bottleneck for on-device applications with long contexts. The core of their contribution is a three-level approach to re-structuring the model and prompt to make them amenable to efficient NPU execution: (1) a "chunk-sharing graph" technique to handle variable-length prompts by breaking them into fixed-size chunks and sharing static subgraphs; (2) a "shadow outlier execution" method that splits computation heterogeneously, running quantized operations on the NPU and sparse, high-precision outlier calculations on the CPU/GPU; and (3) an "out-of-order subgraph execution" scheduler to hide the latency of the CPU/GPU-bound operations.

While the paper presents a system that achieves impressive performance gains, my analysis focuses exclusively on the novelty of its core ideas. The central claim to novelty rests not on the use of NPUs for LLM prefill itself—which has been explored in prior work—but rather on the specific system architecture and set of optimizations designed to overcome the impedance mismatch between LLM workloads and existing mobile NPU hardware.

Strengths

The primary novel contribution of this work is the concept of "shadow outlier execution" (§3.3, page 7), which performs a heterogeneous decomposition of the LLM quantization problem. While handling activation outliers with mixed-precision execution is a known technique (e.g., LLM.int8() [33]), the decision to partition the problem across different processing units—bulk INT8 MatMul on the NPU and sparse FP32 outlier MatMul on the CPU—is a genuinely new system-level design pattern in this context. It directly addresses the architectural limitations of NPUs (poor FP performance) and leverages the strengths of the CPU. The subsequent optimizations, such as profiling for "hot channels" to manage memory, are clever engineering refinements built upon this core novel idea.

The "chunk-sharing graph" (§3.2, page 6) is a noteworthy engineering contribution. While its constituent elements—input tiling/chunking, static graph pre-compilation, and subgraph sharing—are all well-established principles in the compiler and systems domains, their synthesis and specific application to the Transformer architecture on mobile NPUs appear to be novel. The insight to distinguish between operators that are static (e.g., FFN) versus dynamic (e.g., Attention) with respect to chunk position and build a memory-efficient execution plan around this distinction is non-trivial and effective.

Weaknesses

The most significant weakness is the overstatement of novelty in the abstract. The paper claims to be "the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency." This claim is not accurate. The authors themselves cite PowerInfer-V2 [94] as the "most relevant work" (§6, page 14), which also utilizes mobile NPUs for prefill acceleration. Since the preprint for PowerInfer-V2 exists, 11m.npu cannot claim to be the "first." The novelty lies in the method, not the fundamental concept of using the NPU. The abstract and introduction should be revised to state the contribution with more precision, focusing on the novel architecture and techniques rather than a claim of primacy.

Furthermore, several of the core techniques are novel applications of existing concepts rather than fundamentally new ideas. The "out-of-order subgraph execution" (§3.4, page 8) is an application of classical task-graph scheduling on a Directed Acyclic Graph (DAG). The formulation of the problem is specific to this work, but the underlying principle is a cornerstone of computer science. The paper would be stronger if it explicitly framed this contribution as a novel application and heuristic for a known scheduling problem, rather than implying the invention of out-of-order execution in this context.

The complexity of the proposed system is substantial, involving multiple layers of offline profiling, graph partitioning, and a microsecond-level online scheduler. The ablation study (Figure 19, page 13) demonstrates that each component contributes to performance, but it is not entirely clear if the gains from the most complex components (e.g., the out-of-order scheduler) justify their complexity over simpler, more conventional heuristics. The delta between a sophisticated scheduler and a simpler greedy one that prioritizes available NPU tasks is not quantified.

Questions to Address In Rebuttal

Clarification on PowerInfer-V2: Please precisely articulate the novel contributions of 11m.npu over PowerInfer-V2 [94]. The related work section dismisses it by saying it doesn't "fully harness NPU capability," which is too vague. What specific, fundamental techniques presented in your paper are absent from or conceptually different than those in PowerInfer-V2?

Overhead of Heterogeneous Outlier Handling: The "shadow outlier execution" method is novel but introduces synchronization and data management overhead between the CPU and NPU. How does this overhead compare to the performance cost of handling outliers on a single, more flexible processor (like a GPU that can efficiently mix INT8 and FP16 operations)? Is there a point (e.g., a higher percentage of outliers) where this heterogeneous split becomes less efficient than a homogeneous approach?

Robustness of "Hot Channel" Profiling: The optimization of only caching weights for "hot channels" (§3.3, page 7) relies on offline profiling. How robust is this profile to shifts in domain or task? If the model is deployed for a new task where outliers frequently appear in previously "cold" channels, would the performance degrade significantly due to repeated disk/flash access for the corresponding weights?

Justification for Scheduler Complexity: The out-of-order scheduler employs a custom heuristic to minimize NPU stalls (§3.4, page 8). Could you provide evidence that this heuristic provides a significant performance advantage over a simpler baseline scheduler, such as a First-In-First-Out (FIFO) ready queue for each processor? This would help justify the added system complexity.
Reply

ReplyAdd progress note

Fast On-device LLM Inference with NPUs

Summary

Strengths

Weaknesses

Questions to Address In Rebuttal

Summary

Strengths

Weaknesses

Questions to Address In Rebuttal