NPU Acceleration

Cactus leverages Neural Processing Units (NPUs) on mobile devices to accelerate inference by 5-11x. This guide covers NPU support, performance gains, and how to enable/disable NPU backends.

Supported NPUs

Apple Neural Engine (iOS/macOS)

Status: ✅ Shipped (Jan 2026, v1.15+) Devices:

iPhone 12 and newer (A14 Bionic+)
iPad Air (4th gen) and newer (M1+)
iPad Pro (3rd gen) and newer (A12X+)
Mac with Apple Silicon (M1, M2, M3, M4)
Apple Watch Series 6+ (S6+)

Supported Models:

Vision models: LFM2-VL-450M, LFM2.5-VL-1.6B
Speech models: whisper-tiny, whisper-base, whisper-small, whisper-medium, parakeet-ctc-0.6b, parakeet-ctc-1.1b, parakeet-tdt-0.6b-v3

Performance Gains:

5-11x faster prefill on iOS/macOS
Vision: 0.2-0.3s first token latency (vs 2-5s CPU)
Speech: 0.1-0.7s transcription latency (vs 1-10s CPU)

Qualcomm Hexagon DSP (Android)

Status: 🚧 Coming Soon (Mar 2026, v1.16+) Devices:

Snapdragon 8 Gen 1 and newer
Snapdragon 7 Gen 1 and newer (select models)

Expected Performance: 5-11x faster prefill on Android flagships

MediaTek APU (Android)

Status: 📅 Planned (Apr 2026) Devices:

Dimensity 9000 series
Dimensity 8000 series

Google Tensor (Android)

Status: 📅 Planned (Mar 2026, v1.16+) Devices:

Pixel 6, 7, 8, 9 series (TPU)

Performance Benchmarks

From the Cactus README benchmarks table:

With Apple NPU

LFM2.5-VL-1.6B (Vision model, 256px input):

Device	First Token (NPU)	Decode TPS	RAM
Mac M4 Pro	0.2s	98 tok/s	76MB
iPad/Mac M3	0.3s	69 tok/s	70MB
iPhone 17 Pro	0.3s	48 tok/s	108MB
iPhone 13 Mini	0.3s	35 tok/s	1GB

Parakeet-1.1B (Speech model, 30s audio):

Device	Transcription (NPU)	Decode TPS	RAM
Mac M4 Pro	0.1s	900k+ tok/s	76MB
iPad/Mac M3	0.3s	800k+ tok/s	70MB
iPhone 17 Pro	0.3s	300k+ tok/s	108MB
iPhone 13 Mini	0.7s	90k+ tok/s	1GB

LFM 1.2B (Text model, 1k prefill / 100 decode):

Device	Prefill TPS	Decode TPS	RAM
Mac M4 Pro	582 tok/s	100 tok/s	76MB
iPad/Mac M3	350 tok/s	60 tok/s	70MB
iPhone 17 Pro	327 tok/s	48 tok/s	108MB
iPhone 13 Mini	148 tok/s	34 tok/s	1GB

Without NPU (Android)

Note: NPU support coming Mar 2026 for Qualcomm/Google, Apr 2026 for MediaTek.

Device	LFM 1.2B	RAM
Galaxy S25 Ultra	255/37 tok/s	1.5GB
Pixel 6a	70/15 tok/s	1GB
Galaxy A17 5G	32/10 tok/s	727MB

Expected 5-11x prefill speedup once NPU support ships.

How NPU Acceleration Works

Cactus uses a hybrid approach:

NPU for Prefill — Encoder and prefill transformer layers run on NPU
CPU/SIMD for Decode — Token-by-token generation uses ARM SIMD kernels
Zero-Copy Handoff — Seamless transitions between NPU and CPU

Why Hybrid?

NPU excels at batch processing — Prefill processes many tokens at once
CPU excels at autoregressive decode — Single token generation is memory-bound
Best of both worlds — 5-11x faster prefill, no decode slowdown

Enabling NPU Backend

Automatic (Default)

NPU is automatically enabled if:

Device has compatible NPU hardware
Model supports NPU acceleration
Cactus runtime includes NPU support

No configuration needed — it just works!

Checking NPU Availability

#include "cactus/npu/npu.h"

if (cactus::npu::is_npu_available()) {
    printf("NPU acceleration available\n");
} else {
    printf("NPU not available, using CPU fallback\n");
}

Loading NPU Prefill Model

#include "cactus/engine/engine.h"

cactus::Engine* engine = /* ... load model ... */;

// Load NPU-optimized prefill weights
if (engine->load_npu_prefill("./model/npu_prefill.mlmodelc")) {
    printf("NPU prefill loaded, chunk size: %zu\n", 
           engine->get_prefill_chunk_size());
} else {
    printf("NPU prefill not available, using CPU\n");
}

Disabling NPU Backend

Currently, NPU is automatically used when available. Manual disable is not exposed in the public API. Why? NPU provides significant speedups with no quality loss. Disabling would only hurt performance. Future: Advanced users may get a config option to force CPU-only mode for debugging.

Configuration Options

Prefill Chunk Size

NPU prefill processes input in chunks (default: 256 tokens).

// Get current chunk size
size_t chunk_size = engine->get_prefill_chunk_size();
printf("Prefill chunk size: %zu\n", chunk_size);

// Chunk size is determined by NPU model architecture
// Not configurable at runtime

Chunk size affects:

Memory usage — Larger chunks use more NPU memory
Latency — Smaller chunks add overhead
Throughput — 256 is optimal for most models

The chunk size is baked into the NPU model file (.mlmodelc) and cannot be changed at runtime. It’s optimized for each model during conversion.

KV Cache Window

NPU prefill respects KV cache window settings:

// Set sliding window cache (e.g., 2048 tokens)
engine->set_cache_window(2048, 4);  // window_size, sink_size

// NPU prefill will process chunks up to window limit

See Performance Tuning for KV cache configuration.

Platform Requirements

iOS Requirements

iOS 14.0+ for Neural Engine support
iOS 15.0+ recommended (better ANE APIs)
A12 Bionic or newer (iPhone XS, XR, and later)

macOS Requirements

macOS 11.0+ (Big Sur) for Apple Silicon
M1 or newer (M1, M2, M3, M4)
Intel Macs not supported (no Neural Engine)

Android Requirements (Coming Soon)

Android API 29+ (Android 10) for Hexagon DSP
Snapdragon 8 Gen 1+ or Dimensity 9000+
NNAPI or QNN runtime installed

Model Compatibility

NPU-Accelerated Models

From the Supported Models table: Vision Models:

LiquidAI/LFM2-VL-450M — Vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-1.6B — Vision, txt & img embed, Apple NPU

Speech Models:

openai/whisper-tiny — Transcription, speech embed, Apple NPU
openai/whisper-base — Transcription, speech embed, Apple NPU
openai/whisper-small — Transcription, speech embed, Apple NPU
openai/whisper-medium — Transcription, speech embed, Apple NPU
nvidia/parakeet-ctc-0.6b — Transcription, speech embed, Apple NPU
nvidia/parakeet-ctc-1.1b — Transcription, speech embed, Apple NPU
nvidia/parakeet-tdt-0.6b-v3 — Transcription, speech embed, Apple NPU

Text Models:

Most LLMs use NPU for prefill (when available)
Decode always uses CPU SIMD kernels

Models Without NPU Support

google/gemma-3-* — CPU-only (for now)
Qwen/Qwen3-* — CPU-only (for now)
Embedding-only models — CPU-only
VAD models — CPU-only

Custom fine-tuned models inherit NPU support from their base model. If the base model supports NPU, your fine-tune will too — no extra work required!

Troubleshooting

NPU Not Available

Check device compatibility:

if (!cactus::npu::is_npu_available()) {
    // Device doesn't have compatible NPU, or
    // NPU support not compiled into runtime
}

Common reasons:

Older device (pre-A12 iPhone, Intel Mac)
Android device (NPU coming Mar 2026)
Simulator build (NPU only on physical devices)

Slow Despite NPU

Check model supports NPU — Not all models have NPU variants
Verify NPU prefill loaded — Call load_npu_prefill() explicitly
Measure properly — Use --benchmark flag for accurate timing
Thermal throttling — Device may throttle under sustained load

Memory Issues

NPU models use additional memory:

Vision models: +100-200MB for NPU weights
Speech models: +50-100MB for NPU weights
Text models: +20-50MB for NPU prefill cache

If running out of memory:

Reduce KV cache window size
Use smaller model variant
Disable NPU prefill (not recommended)

Roadmap

Q1 2026 (Shipped)

✅ Apple Neural Engine support (Jan 2026)
✅ Vision model NPU acceleration
✅ Speech model NPU acceleration
✅ 5-11x prefill speedup on iOS/macOS

Q1-Q2 2026 (Coming)

🚧 Qualcomm Hexagon DSP (Mar 2026)
🚧 Google Tensor TPU (Mar 2026)
📅 MediaTek APU (Apr 2026)
📅 Samsung Exynos NPU (Apr 2026)

Q2-Q3 2026 (Planned)

📅 Mac GPU acceleration (May 2026)
📅 Wearables optimizations (Jul 2026)
📅 Additional model NPU support

​NPU Acceleration

​Supported NPUs

​Apple Neural Engine (iOS/macOS)

​Qualcomm Hexagon DSP (Android)

​MediaTek APU (Android)

​Google Tensor (Android)

​Performance Benchmarks

​With Apple NPU

​Without NPU (Android)

​How NPU Acceleration Works

​Why Hybrid?

​Enabling NPU Backend

​Automatic (Default)

​Checking NPU Availability

​Loading NPU Prefill Model

​Disabling NPU Backend

​Configuration Options

​Prefill Chunk Size

​KV Cache Window

​Platform Requirements

​iOS Requirements

​macOS Requirements

​Android Requirements (Coming Soon)

​Model Compatibility

​NPU-Accelerated Models

​Models Without NPU Support

​Troubleshooting

​NPU Not Available

​Slow Despite NPU

​Memory Issues

​Roadmap

​Q1 2026 (Shipped)

​Q1-Q2 2026 (Coming)

​Q2-Q3 2026 (Planned)

​See Also

NPU Acceleration

Supported NPUs

Apple Neural Engine (iOS/macOS)

Qualcomm Hexagon DSP (Android)

MediaTek APU (Android)

Google Tensor (Android)

Performance Benchmarks

With Apple NPU

Without NPU (Android)

How NPU Acceleration Works

Why Hybrid?

Enabling NPU Backend

Automatic (Default)

Checking NPU Availability

Loading NPU Prefill Model

Disabling NPU Backend

Configuration Options

Prefill Chunk Size

KV Cache Window

Platform Requirements

iOS Requirements

macOS Requirements

Android Requirements (Coming Soon)

Model Compatibility

NPU-Accelerated Models

Models Without NPU Support

Troubleshooting

NPU Not Available

Slow Despite NPU

Memory Issues

Roadmap

Q1 2026 (Shipped)

Q1-Q2 2026 (Coming)

Q2-Q3 2026 (Planned)

See Also