Skip to main content

NPU Acceleration

Cactus leverages Neural Processing Units (NPUs) on mobile devices to accelerate inference by 5-11x. This guide covers NPU support, performance gains, and how to enable/disable NPU backends.

Supported NPUs

Apple Neural Engine (iOS/macOS)

Status: βœ… Shipped (Jan 2026, v1.15+) Devices:
  • iPhone 12 and newer (A14 Bionic+)
  • iPad Air (4th gen) and newer (M1+)
  • iPad Pro (3rd gen) and newer (A12X+)
  • Mac with Apple Silicon (M1, M2, M3, M4)
  • Apple Watch Series 6+ (S6+)
Supported Models:
  • Vision models: LFM2-VL-450M, LFM2.5-VL-1.6B
  • Speech models: whisper-tiny, whisper-base, whisper-small, whisper-medium, parakeet-ctc-0.6b, parakeet-ctc-1.1b, parakeet-tdt-0.6b-v3
Performance Gains:
  • 5-11x faster prefill on iOS/macOS
  • Vision: 0.2-0.3s first token latency (vs 2-5s CPU)
  • Speech: 0.1-0.7s transcription latency (vs 1-10s CPU)

Qualcomm Hexagon DSP (Android)

Status: 🚧 Coming Soon (Mar 2026, v1.16+) Devices:
  • Snapdragon 8 Gen 1 and newer
  • Snapdragon 7 Gen 1 and newer (select models)
Expected Performance: 5-11x faster prefill on Android flagships

MediaTek APU (Android)

Status: πŸ“… Planned (Apr 2026) Devices:
  • Dimensity 9000 series
  • Dimensity 8000 series

Google Tensor (Android)

Status: πŸ“… Planned (Mar 2026, v1.16+) Devices:
  • Pixel 6, 7, 8, 9 series (TPU)

Performance Benchmarks

From the Cactus README benchmarks table:

With Apple NPU

LFM2.5-VL-1.6B (Vision model, 256px input):
DeviceFirst Token (NPU)Decode TPSRAM
Mac M4 Pro0.2s98 tok/s76MB
iPad/Mac M30.3s69 tok/s70MB
iPhone 17 Pro0.3s48 tok/s108MB
iPhone 13 Mini0.3s35 tok/s1GB
Parakeet-1.1B (Speech model, 30s audio):
DeviceTranscription (NPU)Decode TPSRAM
Mac M4 Pro0.1s900k+ tok/s76MB
iPad/Mac M30.3s800k+ tok/s70MB
iPhone 17 Pro0.3s300k+ tok/s108MB
iPhone 13 Mini0.7s90k+ tok/s1GB
LFM 1.2B (Text model, 1k prefill / 100 decode):
DevicePrefill TPSDecode TPSRAM
Mac M4 Pro582 tok/s100 tok/s76MB
iPad/Mac M3350 tok/s60 tok/s70MB
iPhone 17 Pro327 tok/s48 tok/s108MB
iPhone 13 Mini148 tok/s34 tok/s1GB

Without NPU (Android)

Note: NPU support coming Mar 2026 for Qualcomm/Google, Apr 2026 for MediaTek.
DeviceLFM 1.2BRAM
Galaxy S25 Ultra255/37 tok/s1.5GB
Pixel 6a70/15 tok/s1GB
Galaxy A17 5G32/10 tok/s727MB
Expected 5-11x prefill speedup once NPU support ships.

How NPU Acceleration Works

Cactus uses a hybrid approach:
  1. NPU for Prefill β€” Encoder and prefill transformer layers run on NPU
  2. CPU/SIMD for Decode β€” Token-by-token generation uses ARM SIMD kernels
  3. Zero-Copy Handoff β€” Seamless transitions between NPU and CPU

Why Hybrid?

  • NPU excels at batch processing β€” Prefill processes many tokens at once
  • CPU excels at autoregressive decode β€” Single token generation is memory-bound
  • Best of both worlds β€” 5-11x faster prefill, no decode slowdown

Enabling NPU Backend

Automatic (Default)

NPU is automatically enabled if:
  1. Device has compatible NPU hardware
  2. Model supports NPU acceleration
  3. Cactus runtime includes NPU support
No configuration needed β€” it just works!

Checking NPU Availability

#include "cactus/npu/npu.h"

if (cactus::npu::is_npu_available()) {
    printf("NPU acceleration available\n");
} else {
    printf("NPU not available, using CPU fallback\n");
}

Loading NPU Prefill Model

#include "cactus/engine/engine.h"

cactus::Engine* engine = /* ... load model ... */;

// Load NPU-optimized prefill weights
if (engine->load_npu_prefill("./model/npu_prefill.mlmodelc")) {
    printf("NPU prefill loaded, chunk size: %zu\n", 
           engine->get_prefill_chunk_size());
} else {
    printf("NPU prefill not available, using CPU\n");
}

Disabling NPU Backend

Currently, NPU is automatically used when available. Manual disable is not exposed in the public API. Why? NPU provides significant speedups with no quality loss. Disabling would only hurt performance. Future: Advanced users may get a config option to force CPU-only mode for debugging.

Configuration Options

Prefill Chunk Size

NPU prefill processes input in chunks (default: 256 tokens).
// Get current chunk size
size_t chunk_size = engine->get_prefill_chunk_size();
printf("Prefill chunk size: %zu\n", chunk_size);

// Chunk size is determined by NPU model architecture
// Not configurable at runtime
Chunk size affects:
  • Memory usage β€” Larger chunks use more NPU memory
  • Latency β€” Smaller chunks add overhead
  • Throughput β€” 256 is optimal for most models
The chunk size is baked into the NPU model file (.mlmodelc) and cannot be changed at runtime. It’s optimized for each model during conversion.

KV Cache Window

NPU prefill respects KV cache window settings:
// Set sliding window cache (e.g., 2048 tokens)
engine->set_cache_window(2048, 4);  // window_size, sink_size

// NPU prefill will process chunks up to window limit
See Performance Tuning for KV cache configuration.

Platform Requirements

iOS Requirements

  • iOS 14.0+ for Neural Engine support
  • iOS 15.0+ recommended (better ANE APIs)
  • A12 Bionic or newer (iPhone XS, XR, and later)

macOS Requirements

  • macOS 11.0+ (Big Sur) for Apple Silicon
  • M1 or newer (M1, M2, M3, M4)
  • Intel Macs not supported (no Neural Engine)

Android Requirements (Coming Soon)

  • Android API 29+ (Android 10) for Hexagon DSP
  • Snapdragon 8 Gen 1+ or Dimensity 9000+
  • NNAPI or QNN runtime installed

Model Compatibility

NPU-Accelerated Models

From the Supported Models table: Vision Models:
  • LiquidAI/LFM2-VL-450M β€” Vision, txt & img embed, Apple NPU
  • LiquidAI/LFM2.5-VL-1.6B β€” Vision, txt & img embed, Apple NPU
Speech Models:
  • openai/whisper-tiny β€” Transcription, speech embed, Apple NPU
  • openai/whisper-base β€” Transcription, speech embed, Apple NPU
  • openai/whisper-small β€” Transcription, speech embed, Apple NPU
  • openai/whisper-medium β€” Transcription, speech embed, Apple NPU
  • nvidia/parakeet-ctc-0.6b β€” Transcription, speech embed, Apple NPU
  • nvidia/parakeet-ctc-1.1b β€” Transcription, speech embed, Apple NPU
  • nvidia/parakeet-tdt-0.6b-v3 β€” Transcription, speech embed, Apple NPU
Text Models:
  • Most LLMs use NPU for prefill (when available)
  • Decode always uses CPU SIMD kernels

Models Without NPU Support

  • google/gemma-3-* β€” CPU-only (for now)
  • Qwen/Qwen3-* β€” CPU-only (for now)
  • Embedding-only models β€” CPU-only
  • VAD models β€” CPU-only
Custom fine-tuned models inherit NPU support from their base model. If the base model supports NPU, your fine-tune will too β€” no extra work required!

Troubleshooting

NPU Not Available

Check device compatibility:
if (!cactus::npu::is_npu_available()) {
    // Device doesn't have compatible NPU, or
    // NPU support not compiled into runtime
}
Common reasons:
  • Older device (pre-A12 iPhone, Intel Mac)
  • Android device (NPU coming Mar 2026)
  • Simulator build (NPU only on physical devices)

Slow Despite NPU

  1. Check model supports NPU β€” Not all models have NPU variants
  2. Verify NPU prefill loaded β€” Call load_npu_prefill() explicitly
  3. Measure properly β€” Use --benchmark flag for accurate timing
  4. Thermal throttling β€” Device may throttle under sustained load

Memory Issues

NPU models use additional memory:
  • Vision models: +100-200MB for NPU weights
  • Speech models: +50-100MB for NPU weights
  • Text models: +20-50MB for NPU prefill cache
If running out of memory:
  1. Reduce KV cache window size
  2. Use smaller model variant
  3. Disable NPU prefill (not recommended)

Roadmap

Q1 2026 (Shipped)

  • βœ… Apple Neural Engine support (Jan 2026)
  • βœ… Vision model NPU acceleration
  • βœ… Speech model NPU acceleration
  • βœ… 5-11x prefill speedup on iOS/macOS

Q1-Q2 2026 (Coming)

  • 🚧 Qualcomm Hexagon DSP (Mar 2026)
  • 🚧 Google Tensor TPU (Mar 2026)
  • πŸ“… MediaTek APU (Apr 2026)
  • πŸ“… Samsung Exynos NPU (Apr 2026)

Q2-Q3 2026 (Planned)

  • πŸ“… Mac GPU acceleration (May 2026)
  • πŸ“… Wearables optimizations (Jul 2026)
  • πŸ“… Additional model NPU support

See Also