Introduction to Cactus

Cactus is a hybrid low-latency energy-efficient AI engine designed specifically for mobile devices and wearables. It enables on-device AI inference with OpenAI-compatible APIs across all major platforms and programming languages.

What is Cactus?

Cactus provides a complete stack for running AI models on-device:

┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for all major languages
└─────────────────┘     Chat, vision, STT, RAG, tool call, cloud handoff
         │
┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph (PyTorch for mobile)
└─────────────────┘     Custom models, optimised for RAM & quantisation
         │
┌─────────────────┐
│ Cactus Kernels  │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
└─────────────────┘     Custom attention, KV-cache quant, chunked prefill

Key Benefits

NPU Acceleration

Cactus leverages Neural Processing Units (NPUs) on Apple Silicon for 5-11x faster inference on iOS and macOS devices. Support for Qualcomm, Google, Mediatek, and Exynos NPUs is coming soon.

On an iPhone 17 Pro, Cactus achieves 327 tokens/sec prefill and 48 tokens/sec decode for LFM 1.2B models with only 108MB RAM usage.

Aggressive Quantization

Models are quantized to INT4 by default with lossless quality:

1.5x faster inference with hybrid quantization
70-90% smaller model sizes compared to FP16
Support for INT4, INT8, and FP16 precision levels

Run multiple AI tasks on the same device:

Text Generation: Chat completion with tool calling
Vision-Language: Image understanding and captioning
Speech-to-Text: Real-time audio transcription with streaming
Embeddings: Text, image, and audio embeddings for RAG
Voice Activity Detection: Detect speech segments in audio

Cloud Handoff

Cactus automatically detects when local models have low confidence and can hand off to cloud-based models for better results:

{
  "cloud_handoff": true,
  "confidence": 0.18,
  "time_to_first_token_ms": 45.2
}

Performance Benchmarks

All weights INT4 quantized, LFM 1.2B model (1k-prefill / 100-decode):

Device	Prefill/Decode TPS	RAM Usage
Mac M4 Pro	582/100	76MB
iPad/Mac M3	350/60	70MB
iPhone 17 Pro	327/48	108MB
Galaxy S25 Ultra	255/37	1.5GB
Raspberry Pi 5	69/11	869MB

Android NPU support (Qualcomm/Google/Mediatek) is coming in March-April 2026. Current Android performance uses CPU kernels.

Supported Platforms

Cactus runs on:

iOS 14.0+ (iPhone, iPad)
macOS 13.0+ (Intel & Apple Silicon)
Android API 24+ (arm64-v8a)
Linux (Ubuntu, Debian, Raspberry Pi)
watchOS & tvOS (via Swift SDK)

Language SDKs

Integrate Cactus using your preferred language:

SDK	Platforms	Description
C API	All	Core FFI for maximum performance
Python	Mac, Linux	Pythonic bindings with streaming
Swift	iOS, macOS, tvOS, watchOS	Native async/await support
Kotlin	Android, iOS (KMP)	Multiplatform mobile development
Flutter	iOS, macOS, Android	Cross-platform mobile apps
Rust	Mac, Linux	Type-safe bindings via bindgen
React Native	iOS, Android	JavaScript for mobile

Supported Models

Cactus works with popular open-source models:

Model Family	Features	Sizes
LiquidAI LFM	completion, vision, tools, embeddings	350M - 8B
Google Gemma	completion, tools	270M - 1B
Qwen	completion, tools, embeddings	600M - 1.7B
Whisper	transcription, speech embed, NPU	tiny - medium
Parakeet	transcription, speech embed, NPU	600M - 1.1B
Moonshine	transcription, speech embed	base
Nomic Embed	text embeddings	MoE

See the Models page for the complete list of 25+ supported models.

Quick Demo

Get started in 2 steps:

# Step 1: Install Cactus
brew install cactus-compute/cactus/cactus

# Step 2: Run a model
cactus run

The CLI automatically downloads models and opens an interactive playground.

Next Steps

Quickstart

Follow the Quickstart to run your first inference in under 5 minutes.

Installation

See Installation for detailed setup instructions for your platform.

API Reference

Explore the Engine API to understand all available functions.

SDK Guides

Check out SDK-specific guides for Python, Swift, or Kotlin.

Architecture Highlights

Cactus Engine

OpenAI-compatible APIs for chat completion, vision, transcription, embeddings, RAG, and tool calling.

Cactus Graph

Zero-copy computation graph optimized for mobile RAM constraints with support for custom models.

Cactus Kernels

Hand-optimized ARM SIMD kernels for:

Custom attention mechanisms
KV-cache quantization (2x prefill speedup)
Chunked prefill (10-token & 1k-token prefills have same decode speed)

Community & Support

Cactus is maintained by Cactus Compute, Inc. (YC S25) and leading university AI societies.

Cactus is open-source and actively accepting contributions. See the Contributing Guide to get involved.

​Introduction to Cactus

​What is Cactus?

​Key Benefits

​NPU Acceleration

​Aggressive Quantization

​Multi-Modal Support

​Cloud Handoff

​Performance Benchmarks

​Supported Platforms

​Language SDKs

​Supported Models

​Quick Demo

​Next Steps

​Architecture Highlights

​Cactus Engine

​Cactus Graph

​Cactus Kernels

​Community & Support

Introduction to Cactus

What is Cactus?

Key Benefits

NPU Acceleration

Aggressive Quantization

Multi-Modal Support

Cloud Handoff

Performance Benchmarks

Supported Platforms

Language SDKs

Supported Models

Quick Demo

Next Steps

Architecture Highlights

Cactus Engine

Cactus Graph

Cactus Kernels

Community & Support