Introduction to Cactus
Cactus is a hybrid low-latency energy-efficient AI engine designed specifically for mobile devices and wearables. It enables on-device AI inference with OpenAI-compatible APIs across all major platforms and programming languages.
What is Cactus?
Cactus provides a complete stack for running AI models on-device:Key Benefits
NPU Acceleration
Cactus leverages Neural Processing Units (NPUs) on Apple Silicon for 5-11x faster inference on iOS and macOS devices. Support for Qualcomm, Google, Mediatek, and Exynos NPUs is coming soon.On an iPhone 17 Pro, Cactus achieves 327 tokens/sec prefill and 48 tokens/sec decode for LFM 1.2B models with only 108MB RAM usage.
Aggressive Quantization
Models are quantized to INT4 by default with lossless quality:- 1.5x faster inference with hybrid quantization
- 70-90% smaller model sizes compared to FP16
- Support for INT4, INT8, and FP16 precision levels
Multi-Modal Support
Run multiple AI tasks on the same device:- Text Generation: Chat completion with tool calling
- Vision-Language: Image understanding and captioning
- Speech-to-Text: Real-time audio transcription with streaming
- Embeddings: Text, image, and audio embeddings for RAG
- Voice Activity Detection: Detect speech segments in audio
Cloud Handoff
Cactus automatically detects when local models have low confidence and can hand off to cloud-based models for better results:Performance Benchmarks
All weights INT4 quantized, LFM 1.2B model (1k-prefill / 100-decode):| Device | Prefill/Decode TPS | RAM Usage |
|---|---|---|
| Mac M4 Pro | 582/100 | 76MB |
| iPad/Mac M3 | 350/60 | 70MB |
| iPhone 17 Pro | 327/48 | 108MB |
| Galaxy S25 Ultra | 255/37 | 1.5GB |
| Raspberry Pi 5 | 69/11 | 869MB |
Supported Platforms
Cactus runs on:- iOS 14.0+ (iPhone, iPad)
- macOS 13.0+ (Intel & Apple Silicon)
- Android API 24+ (arm64-v8a)
- Linux (Ubuntu, Debian, Raspberry Pi)
- watchOS & tvOS (via Swift SDK)
Language SDKs
Integrate Cactus using your preferred language:| SDK | Platforms | Description |
|---|---|---|
| C API | All | Core FFI for maximum performance |
| Python | Mac, Linux | Pythonic bindings with streaming |
| Swift | iOS, macOS, tvOS, watchOS | Native async/await support |
| Kotlin | Android, iOS (KMP) | Multiplatform mobile development |
| Flutter | iOS, macOS, Android | Cross-platform mobile apps |
| Rust | Mac, Linux | Type-safe bindings via bindgen |
| React Native | iOS, Android | JavaScript for mobile |
Supported Models
Cactus works with popular open-source models:| Model Family | Features | Sizes |
|---|---|---|
| LiquidAI LFM | completion, vision, tools, embeddings | 350M - 8B |
| Google Gemma | completion, tools | 270M - 1B |
| Qwen | completion, tools, embeddings | 600M - 1.7B |
| Whisper | transcription, speech embed, NPU | tiny - medium |
| Parakeet | transcription, speech embed, NPU | 600M - 1.1B |
| Moonshine | transcription, speech embed | base |
| Nomic Embed | text embeddings | MoE |
See the Models page for the complete list of 25+ supported models.
Quick Demo
Get started in 2 steps:Next Steps
Quickstart
Follow the Quickstart to run your first inference in under 5 minutes.
Installation
See Installation for detailed setup instructions for your platform.
API Reference
Explore the Engine API to understand all available functions.
Architecture Highlights
Cactus Engine
OpenAI-compatible APIs for chat completion, vision, transcription, embeddings, RAG, and tool calling.Cactus Graph
Zero-copy computation graph optimized for mobile RAM constraints with support for custom models.Cactus Kernels
Hand-optimized ARM SIMD kernels for:- Custom attention mechanisms
- KV-cache quantization (2x prefill speedup)
- Chunked prefill (10-token & 1k-token prefills have same decode speed)
Community & Support
Cactus is maintained by Cactus Compute, Inc. (YC S25) and leading university AI societies.Cactus is open-source and actively accepting contributions. See the Contributing Guide to get involved.