Skip to main content

Introduction to Cactus

Cactus is a hybrid low-latency energy-efficient AI engine designed specifically for mobile devices and wearables. It enables on-device AI inference with OpenAI-compatible APIs across all major platforms and programming languages. Cactus AI Engine

What is Cactus?

Cactus provides a complete stack for running AI models on-device:
┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for all major languages
└─────────────────┘     Chat, vision, STT, RAG, tool call, cloud handoff

┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph (PyTorch for mobile)
└─────────────────┘     Custom models, optimised for RAM & quantisation

┌─────────────────┐
│ Cactus Kernels  │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
└─────────────────┘     Custom attention, KV-cache quant, chunked prefill

Key Benefits

NPU Acceleration

Cactus leverages Neural Processing Units (NPUs) on Apple Silicon for 5-11x faster inference on iOS and macOS devices. Support for Qualcomm, Google, Mediatek, and Exynos NPUs is coming soon.
On an iPhone 17 Pro, Cactus achieves 327 tokens/sec prefill and 48 tokens/sec decode for LFM 1.2B models with only 108MB RAM usage.

Aggressive Quantization

Models are quantized to INT4 by default with lossless quality:
  • 1.5x faster inference with hybrid quantization
  • 70-90% smaller model sizes compared to FP16
  • Support for INT4, INT8, and FP16 precision levels

Multi-Modal Support

Run multiple AI tasks on the same device:
  • Text Generation: Chat completion with tool calling
  • Vision-Language: Image understanding and captioning
  • Speech-to-Text: Real-time audio transcription with streaming
  • Embeddings: Text, image, and audio embeddings for RAG
  • Voice Activity Detection: Detect speech segments in audio

Cloud Handoff

Cactus automatically detects when local models have low confidence and can hand off to cloud-based models for better results:
{
  "cloud_handoff": true,
  "confidence": 0.18,
  "time_to_first_token_ms": 45.2
}

Performance Benchmarks

All weights INT4 quantized, LFM 1.2B model (1k-prefill / 100-decode):
DevicePrefill/Decode TPSRAM Usage
Mac M4 Pro582/10076MB
iPad/Mac M3350/6070MB
iPhone 17 Pro327/48108MB
Galaxy S25 Ultra255/371.5GB
Raspberry Pi 569/11869MB
Android NPU support (Qualcomm/Google/Mediatek) is coming in March-April 2026. Current Android performance uses CPU kernels.

Supported Platforms

Cactus runs on:
  • iOS 14.0+ (iPhone, iPad)
  • macOS 13.0+ (Intel & Apple Silicon)
  • Android API 24+ (arm64-v8a)
  • Linux (Ubuntu, Debian, Raspberry Pi)
  • watchOS & tvOS (via Swift SDK)

Language SDKs

Integrate Cactus using your preferred language:
SDKPlatformsDescription
C APIAllCore FFI for maximum performance
PythonMac, LinuxPythonic bindings with streaming
SwiftiOS, macOS, tvOS, watchOSNative async/await support
KotlinAndroid, iOS (KMP)Multiplatform mobile development
FlutteriOS, macOS, AndroidCross-platform mobile apps
RustMac, LinuxType-safe bindings via bindgen
React NativeiOS, AndroidJavaScript for mobile

Supported Models

Cactus works with popular open-source models:
Model FamilyFeaturesSizes
LiquidAI LFMcompletion, vision, tools, embeddings350M - 8B
Google Gemmacompletion, tools270M - 1B
Qwencompletion, tools, embeddings600M - 1.7B
Whispertranscription, speech embed, NPUtiny - medium
Parakeettranscription, speech embed, NPU600M - 1.1B
Moonshinetranscription, speech embedbase
Nomic Embedtext embeddingsMoE
See the Models page for the complete list of 25+ supported models.

Quick Demo

Get started in 2 steps:
# Step 1: Install Cactus
brew install cactus-compute/cactus/cactus

# Step 2: Run a model
cactus run
The CLI automatically downloads models and opens an interactive playground.

Next Steps

1

Quickstart

Follow the Quickstart to run your first inference in under 5 minutes.
2

Installation

See Installation for detailed setup instructions for your platform.
3

API Reference

Explore the Engine API to understand all available functions.
4

SDK Guides

Check out SDK-specific guides for Python, Swift, or Kotlin.

Architecture Highlights

Cactus Engine

OpenAI-compatible APIs for chat completion, vision, transcription, embeddings, RAG, and tool calling.

Cactus Graph

Zero-copy computation graph optimized for mobile RAM constraints with support for custom models.

Cactus Kernels

Hand-optimized ARM SIMD kernels for:
  • Custom attention mechanisms
  • KV-cache quantization (2x prefill speedup)
  • Chunked prefill (10-token & 1k-token prefills have same decode speed)

Community & Support

Cactus is maintained by Cactus Compute, Inc. (YC S25) and leading university AI societies.
Cactus is open-source and actively accepting contributions. See the Contributing Guide to get involved.