Running an AMD NPU on Linux: Part 3, Where the Time Goes
Note from TC: I admit that this work is out of my technical depth. My motivation in all of this came from annoyance at having an NPU that was apparently useless on Linux and curiosity if Ellie (Opus) could connect together any other work being done on the topic to at least move the needle a smidge. If anyone is reading this post and knows it to be slop on a technical level, I’d love to hear why for my own edification. I am standing by to make corrections or redactions to avoid accidentally spreading AI generated misinformation. This whole project was an experiment, though one that I admit I lack the knowledge to test its outcome. I hope to hear from those who do and that it is useful in some way. -TC
In Part 1 we assembled the stack. In Part 2 we ran Llama 3.2 1B on the NPU. Now we find out why it’s slow, and where the real optimization opportunities are.
The Gap#
The NPU validated at 51 TOPS in the GEMM benchmark. Llama 3.2 1B inference runs at 4.4 tokens per second. Something is leaving a lot of performance on the table. Profiling tells us what.
Methodology#
IRON ships with a built-in profiler (--profile flag) that traces every function call with sys.setprofile, logging timestamps to a file. The companion analyze_profile.py script aggregates the results. Profiling adds about 25% overhead (3.26 tok/s vs 4.40 without), but the relative proportions hold.
We ran three prompt lengths (13, 128, 2048 tokens) with 40 generated tokens each, both with and without profiling. All kernels were pre-compiled from previous runs.
Scaling Results#
Clean runs (no profiling overhead):
| Prompt Length | Prefill Time | Prefill tok/s | Decode tok/s | Time per Token |
|---|---|---|---|---|
| 13 | 0.67s | 19.4 | 4.46 | 242ms |
| 128 | 0.71s | 180.3 | 4.40 | 245ms |
| 2048 | 2.22s | 922.5 | 4.34 | 287ms |
Two things jump out immediately.
Prefill scales well. Going from 13 to 2048 tokens (157x more work), prefill time only increases 3.3x. Prefill throughput rises from 19 tok/s to 922 tok/s. The batch GEMM operations during prefill amortize overhead effectively across the NPU’s tile array.
Decode throughput is flat. Regardless of how many tokens were prefilled, decode hovers at 4.3 to 4.5 tok/s. The decode path generates one token at a time, which means the same per-token overhead dominates no matter what.
Where the Time Goes#
The profiled run (prompt_len=13, 20 tokens generated) tells us exactly where 6.68 seconds of inference time is spent:
| Category | Time (s) | % of Inference |
|---|---|---|
NPU kernel dispatch (run_runlist) | 5.04 | 75.4% |
Buffer reads (read_buffer) | 0.72 | 10.8% |
Buffer writes (write_buffer) | 0.39 | 5.8% |
| Python/torch overhead | 0.53 | 7.9% |
Three quarters of the time is in run_runlist: dispatching kernels to the NPU via XRT and waiting for results. Buffer I/O is fast (0.06ms per write, 0.09ms per read); unified memory is doing its job.
Dispatch Overhead#
Here is the critical number: 3,576 kernel dispatches for 20 tokens. That is roughly 179 dispatches per token.
Each dispatch averages 1.41ms, but the median is only 0.37ms. The distribution is skewed by a few expensive dispatches (up to 51ms for the first prefill pass), while most decode dispatches are sub-millisecond. But 179 of them per token, even at 0.37ms median, adds up to ~66ms of pure dispatch overhead per token. At 242ms per token total, dispatch overhead alone accounts for about 27% of decode time.
The runlist API in XRT allows batching multiple kernel launches, and IRON does use it. But with 179 dispatches per token across 16 transformer blocks, there may be room to batch more aggressively or fuse adjacent operations.
Operator Breakdown#
Breaking decode time down by operator:
| Operator | % of Inference | Calls | Avg (ms) |
|---|---|---|---|
| GQA Attention | 36.7% | 320 | 7.7 |
| SwiGLU (FFN) | 33.7% | 304 | 7.4 |
| GEMV | 27.4% | 1,235 | 1.5 |
| RoPE, RMSNorm, etc. | 2.2% | — | — |
The profile is balanced: no single operator dominates overwhelmingly. Attention and FFN are roughly equal, which is expected for a standard transformer architecture. The small operators (RoPE, RMSNorm, element-wise add) contribute negligibly.
Init and Setup Costs#
Beyond inference, the profiling revealed substantial one-time costs:
| Phase | Time (s) | Notes |
|---|---|---|
| Tokenizer init | 3.35 | tiktoken vocabulary loading |
| Model weight init | 2.98 | Random tensor initialization before safetensors load |
| Compile cache check | 6.06 | Verifies 516 cached artifacts exist |
| XRT runtime setup | 7.02 | Loads xclbins, creates contexts, allocates buffers |
| Total init | ~19.4 | 74% of wall time for a 20-token run |
For a single inference request, you spend 19 seconds on setup and 6.7 seconds on actual work. A persistent server that keeps the runtime alive would eliminate this entirely after the first request.
What This Tells Us#
The optimization roadmap is clear:
1. Reduce dispatches per token. 179 kernel launches per token is the primary bottleneck. Operator fusion (combining adjacent small kernels into larger ones) or more aggressive runlist batching could cut this significantly. The XRT runlist API supports it; the question is whether IRON’s scheduling can be improved.
2. Persistent runtime. A server mode that initializes once and handles multiple requests would turn a 26-second experience into a 6.7-second one. For interactive use, this is essential.
3. Skip redundant cache checks. 6 seconds to verify 516 files exist is wasteful when nothing has changed. A hash-based cache validation could reduce this to milliseconds.
4. Prefill is already fast. At 922 tok/s for 2048 tokens, the batch GEMM path is working well. Optimization effort should focus on the decode path.
5. Quantization is the next frontier. The NPU’s 51 TOPS rating is for INT8; we are running bfloat16. If IRON adds INT8/INT4 kernel support, decode throughput could potentially double without changing anything else.
The Bigger Picture#
4.4 tok/s from a 1B model on early-stage open-source tooling, with 75% of the time spent in dispatch overhead rather than actual compute, is actually encouraging. It means the silicon is fast enough; the software just needs to catch up.
FastFlowLM, the Windows-only closed-source alternative, reportedly achieves significantly better throughput on the same hardware. Since they also use MLIR-AIE/IRON under the hood, their gains likely come from exactly these kinds of dispatch optimizations: fewer launches, better batching, fused operators.
The path from 4.4 to something competitive is a software problem, not a hardware one. And with an open-source stack, anyone can work on it.
Profiling data from IRON’s built-in profiler on AMD Ryzen AI Max+ 395, Fedora 43, kernel 6.18.8. All xclbin artifacts pre-compiled; times exclude first-run compilation.
References#
IRON Framework (GitHub): Built-in profiling via
--profileflag andanalyze_profile.py. github.com/amd/IRONXRT Runlist API: Kernel dispatch batching interface in Xilinx Runtime. Part of amd/xdna-driver XRT build.
Parts 1 and 2: Stack setup and initial inference results. Part 1 | Part 2
Hunhoff, E. et al. “Efficiency, Expressivity, and Extensibility in a Close-to-Metal NPU Programming Interface.” FCCM 2025. arxiv.org/abs/2504.18430
FastFlowLM: Closed-source Windows NPU inference runtime using MLIR-AIE/IRON. github.com/FastFlowLM/FastFlowLM
A note on how this was made: the bulk of the research, testing, debugging, and writing was done by Ellie, an AI assistant backed by Claude Opus 4.6 (Anthropic) and local models. TC provided the hardware, direction, and editorial guidance. We believe in transparency about AI involvement in technical work.