Running an AMD NPU on Linux: Part 2, Llama on Silicon

Note from TC: I admit that this work is out of my technical depth. My motivation in all of this came from annoyance at having an NPU that was apparently useless on Linux and curiosity if Ellie (Opus) could connect together any other work being done on the topic to at least move the needle a smidge. If anyone is reading this post and knows it to be slop on a technical level, I’d love to hear why for my own edification. I am standing by to make corrections or redactions to avoid accidentally spreading AI generated misinformation. This whole project was an experiment, though one that I admit I lack the knowledge to test its outcome. I hope to hear from those who do and that it is useful in some way. -TC

In Part 1, we got the AMD NPU stack working on Fedora 43: driver, firmware, XRT, and the IRON framework. An AXPY test passed. The hardware was talking. Now it’s time to make it do something useful.

We’re going to run Llama 3.2 1B inference entirely on the NPU.

The IRON Llama Demo#

IRON ships with a Llama 3.2 1B application that maps the full transformer architecture onto the NPU. Not a toy benchmark, not a single-layer demo: the entire model, all 16 transformer blocks, running on NPU silicon.

When you launch it, IRON prints a configuration table showing what runs where:

AIE Configuration (✔ = AIE NPU / ✘ = CPU):
[Model] KV Cache ✔
[Attention] Rope ✔
[Attention] QKV GEMM ✔
[Attention] Fused MHA ✔
[Attention] GEMV (Decode) ✔
[FFN] Runlist-based SwiGLU ✔
[FFN] GEMV (Decode) ✔
[Transformer] Residual Addition ✔
[Transformer] Pre Norm ✔
[Transformer] Post Norm ✔
[Transformer] Final Norm ✔
[Transformer] Final GEMM ✔
[Transformer] Final GEMV ✔

Every single operation is on the NPU. No CPU fallback.

Getting the Model#

Llama 3.2 1B requires accepting Meta’s license on HuggingFace, which can take minutes to hours for approval. We sidestepped this by using NousResearch’s ungated mirror, which contains identical weights and includes the sentencepiece tokenizer that IRON expects.

pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
dest = '/home/ellie/models/llama3.2-1b'
hf_hub_download('NousResearch/Llama-3.2-1B', 'model.safetensors', local_dir=dest)
hf_hub_download('NousResearch/Llama-3.2-1B', 'original/tokenizer.model', local_dir=dest)
"

Two files: model.safetensors (2.4GB) and original/tokenizer.model (2.1MB).

Extra Dependencies#

Beyond the base IRON install from Part 1, the Llama demo needs additional packages:

source ~/npu-env/bin/activate
uv pip install tiktoken safetensors blobfile packaging torchtune==0.6.1 torchao==0.13.0

And one system dependency that wasn’t obvious: IRON’s compilation pipeline hardcodes llvm-objcopy-18 for symbol renaming in kernel archives. On Fedora 43, LLVM 21 is the system version:

sudo dnf install llvm
sudo ln -sf /usr/bin/llvm-objcopy /usr/local/bin/llvm-objcopy-18

This works because llvm-objcopy’s --redefine-sym flag is stable across versions.

First-Time Compilation#

Here’s the part nobody warns you about: the first time you run the Llama demo, IRON compiles every NPU kernel from MLIR source. That means:

RoPE (rotary position embeddings) for prefill and decode
Fused multi-head attention with grouped query attention
GEMM variants for QKV projections, FFN layers, and output
GEMV for single-token decode
RMSNorm, SiLU activation, element-wise operations

Each kernel gets lowered from MLIR through the AIE compiler (aiecc) into an .xclbin binary that can be loaded onto the NPU. The MHA kernel alone takes several minutes to compile; it’s a 200KB MLIR file describing the full attention pipeline for 32 heads.

Total first-run compilation: about 10 minutes. Subsequent runs skip compilation because the .xclbin files are cached in the build/ directory.

The prompt_len Gotcha#

This one cost us a few attempts. The --prompt_len parameter doesn’t just truncate your prompt; it determines the tensor shapes that IRON compiles the NPU kernels for. The RoPE operator, for instance, is compiled for exactly prompt_len rows.

If your prompt tokenizes to fewer tokens than prompt_len, you get:

AIEOperatorConstraintError: AIERope: incompatible tensor shape(s)

The fix: make sure your prompt has at least prompt_len tokens. IRON will truncate longer prompts to fit, but it won’t pad shorter ones.

The Result#

python inference.py \
  /home/ellie/models/llama3.2-1b/model.safetensors \
  /home/ellie/models/llama3.2-1b/original/tokenizer.model \
  --num_tokens 20 \
  --prompt "The meaning of life is a question that has puzzled philosophers and thinkers for centuries throughout human history and beyond" \
  --prompt_len 13

Output:

Starting prefill inference...
The meaning of life is a question that has puzzled philosophers and thinkers
for centuries. Some have argued that it is a question that can only be
answered by a higher power

=======================================================
 TIMING RESULTS:
  Total time: 5.0131 seconds
  Prefill time: 0.6921 seconds
  Tokens generated: 20
  Tokens per second: 4.40
  Time per token: 0.2638 seconds
=======================================================

Coherent text. Generated entirely on the NPU. On Linux.

Putting the Numbers in Context#

4.40 tokens per second from a 1B model on the NPU. How does that compare?

For reference, on the same machine, Qwen3-32B (a 32x larger model) runs at 6-7 tokens per second on the Radeon 8060S GPU via Vulkan with llama.cpp. The NPU is doing a smaller model slower, which tells you that the IRON inference path is not yet optimized for throughput the way llama.cpp is for GPUs.

But that’s not the point. The point is that it works at all.

The NPU validated at 51 TOPS in the GEMM benchmark from Part 1. The theoretical throughput for a 1B model at INT8 should be significantly higher than 4.4 t/s. The gap between theoretical and actual tells us how much room there is for optimization in the software stack: kernel scheduling, memory transfer overhead, pipeline utilization.

This is where the ecosystem is right now: functional, correct, and waiting for someone to make it fast.

What We Learned#

The full stack, from nothing to Llama inference on an AMD NPU on Linux:

Out-of-tree kernel module (amdxdna v1.0.0, from source)
Development firmware (npu.dev.sbin)
XRT 2.23 (from source, with GCC 15 linker workaround)
pyxrt (from source, for Python 3.13 with runlist API)
IRON + mlir-aie v1.2.0 (pip install)
llvm-objcopy symlink (version mismatch workaround)
Model weights (NousResearch mirror for ungated access)

Seven layers of workarounds to get from “kernel sees the NPU” to “Llama generates text.” Every one of them was solvable, but none of them were documented in one place. Until now.

What’s Next#

The obvious question: can we make it faster? The NPU has the silicon for it. 51 TOPS is a lot of compute for a 1B model. The bottleneck is likely in the data movement between host memory and the NPU, and in how IRON schedules operations across the AIE tile array.

There’s also the question of larger models. Llama 3.2 1B fits comfortably in memory, but could we partition a 3B or 7B model across the NPU and GPU? The hardware is on the same die, sharing the same memory pool. In theory, that’s the whole point of AMD’s APU architecture.

For now, we’ve proven the concept: AMD NPU inference on Linux works. The stack is open source. The documentation exists (you’re reading it). And someone else can pick this up and push it further.

That’s how ecosystems start.

Tested on: AMD Ryzen AI Max+ 395, Fedora 43, kernel 6.18.8, IRON v1.2.0, mlir-aie v1.2.0, XRT 2.23, Llama 3.2 1B (NousResearch mirror).

Next up: Part 3, Where the Time Goes — profiling the inference pipeline to find out what’s actually slow.

References#

IRON Llama 3.2 1B Application (GitHub): The inference demo we ran. github.com/amd/IRON/tree/devel/iron/applications/llama_3.2_1b
NousResearch/Llama-3.2-1B (HuggingFace): Ungated mirror of Meta’s Llama 3.2 1B weights with sentencepiece tokenizer. huggingface.co/NousResearch/Llama-3.2-1B
Part 1: Getting the Hardware to Talk: Driver, firmware, XRT, and IRON setup. Part 1
Hunhoff, E. et al. “Efficiency, Expressivity, and Extensibility in a Close-to-Metal NPU Programming Interface.” 33rd IEEE International Symposium on Field-Programmable Custom Computing Machines, May 2025. arxiv.org/abs/2504.18430
AMD IRON (GitHub): Open-source Python API for Ryzen AI NPUs. github.com/amd/IRON

A note on how this was made: the bulk of the research, testing, debugging, and writing was done by Ellie, an AI assistant backed by Claude Opus 4.6 (Anthropic) and local models. TC provided the hardware, direction, and editorial guidance. We believe in transparency about AI involvement in technical work.