Skip to main content

Architecture

Cactus is designed as a three-layer architecture that separates high-level AI workflows from low-level hardware optimizations. This design enables efficient on-device inference across diverse mobile hardware while maintaining clean, maintainable code.

Three-Layer Design

Cactus is built on a modular architecture that cleanly separates concerns:
┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for all major languages
└─────────────────┘     Chat, vision, STT, RAG, tool call, cloud handoff

┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph (PyTorch for mobile)
└─────────────────┘     Custom models, optimised for RAM & quantisation

┌─────────────────┐
│ Cactus Kernels  │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
└─────────────────┘     Custom attention, KV-cache quant, chunked prefill
The Engine Layer provides developer-friendly APIs for common AI tasks:
  • Text Completion: Chat, instruction following, conversation
  • Vision: Image understanding, multi-modal inputs
  • Transcription: Speech-to-text (Whisper, Parakeet, Moonshine)
  • Embeddings: Text and image embeddings for RAG
  • Tool Calling: Function calling with JSON schema validation
  • Cloud Handoff: Automatic fallback to cloud models based on confidence
Key Features:
  • OpenAI-compatible message format
  • Streaming token generation
  • RAG with automatic vector indexing
  • Multi-language SDKs (Python, Swift, Kotlin, Rust, Flutter)
See the Engine API Reference for complete documentation.
The Graph Layer is a zero-copy computation graph framework inspired by PyTorch:
  • Node-Based Operations: Tensor operations as graph nodes
  • Lazy Execution: Build graph first, execute later
  • Memory Efficiency: Zero-copy operations, smart buffer pooling
  • Precision Support: INT4, INT8, FP16, FP32
  • Mixed Precision: Automatic precision casting where needed
Key Operations:
// Matrix operations
size_t result = graph.matmul(a, b, pretransposed_rhs);
size_t transposed = graph.transpose(input);

// Attention mechanisms
size_t attn = graph.attention(query, key, value, scale, 
                              position_offset, window_size);

// Normalization
size_t norm = graph.rms_norm(input, weight, epsilon);
size_t rope = graph.rope(input, theta, position_offset);
The graph layer handles precision conversions, broadcasting, and operator fusion automatically.See the Graph API Reference for complete documentation.
The Kernel Layer contains highly optimized ARM SIMD implementations:
  • INT4/INT8 Quantized Operations: Group quantization with FP16 scales
  • Custom Attention: Optimized for mobile memory hierarchies
  • KV Cache Quantization: INT8 cache compression
  • NEON/SME2 SIMD: ARM vector intrinsics for maximum throughput
  • Cache-Friendly Access: Optimized memory access patterns
Key Kernels:
  • cactus_matmul_int4(): INT4 matrix multiplication
  • cactus_matmul_int8(): INT8 matrix multiplication with group scales
  • cactus_attention_f16(): FP16 attention with causal masking
  • cactus_attention_hybrid_int8_fp16(): Hybrid KV cache attention
  • cactus_quantize_kv_fp16_to_int8(): KV cache quantization
All kernels are implemented in cactus/kernel/ and optimized for Apple, Qualcomm, and Samsung processors.

Hybrid NPU/CPU Execution

Cactus intelligently distributes computation between NPU (Neural Processing Unit) and CPU for optimal performance:
Supported Operations:
  • Matrix multiplication (up to 11x faster than CPU)
  • Convolution layers
  • Attention mechanisms
  • Layer normalization
Apple NPU Features:
  • Available on A14+ (iPhone 12+) and M1+ Macs
  • FP16 precision
  • Automatic chunked prefill for long sequences
  • Zero-copy integration with CPU execution
Performance Gains:
DeviceCPU OnlyWith NPUSpeedup
iPhone 17 Pro48 t/s327 t/s6.8x
Mac M4 Pro100 t/s582 t/s5.8x
iPad M360 t/s350 t/s5.8x
Models with NPU Support:
  • LFM2-VL (vision models)
  • Whisper (all sizes)
  • Parakeet transcription models

Key Architectural Features

Chunked Prefill

Cactus processes long input sequences in chunks to maintain consistent decode speeds:
// Automatic chunking in prefill
void Model::prefill(const std::vector<uint32_t>& tokens, 
                   size_t chunk_size = 256) {
    // Process tokens in chunks of 256 (configurable)
    for (size_t i = 0; i < tokens.size(); i += chunk_size) {
        size_t chunk_end = std::min(i + chunk_size, tokens.size());
        // Process chunk...
    }
}
Benefits:
  • Long prefill (1000 tokens) = same decode speed as short prefill (10 tokens)
  • Prevents memory pressure during prefill
  • Enables streaming responses sooner

KV Cache Quantization

Cactus compresses the key-value cache using INT8 quantization with group scales:
struct KVCache {
    // INT8 quantized keys/values
    std::vector<int8_t> keys;
    std::vector<int8_t> values;
    
    // FP16 scales per group (default: 32 elements)
    std::vector<float> key_scales;
    std::vector<float> value_scales;
    
    size_t window_size = 1024;  // Sliding window
    size_t sink_size = 4;        // Keep first 4 tokens
};
Memory Savings:
  • 2x reduction in cache size (FP16 → INT8)
  • No accuracy loss with group quantization
  • Critical for long context on mobile devices

Cactus Attention

Custom attention implementation optimized for mobile:
size_t attention = graph.attention(
    query,           // [batch, seq_len, num_heads, head_dim]
    key,             // [batch, kv_len, num_kv_heads, head_dim]
    value,           // [batch, kv_len, num_kv_heads, head_dim]
    scale,           // 1/sqrt(head_dim)
    position_offset, // For incremental decoding
    window_size,     // Sliding window attention
    is_causal        // Causal masking
);
Optimizations:
  • Fused softmax and matrix multiply
  • Cache-friendly memory access
  • Multi-head attention with GQA (Grouped Query Attention)
Sliding window attention for long sequences:
// Only attend to last 1024 tokens
kv_cache.set_window_size(1024, sink_size = 4);

size_t attention = graph.attention(
    query, key, value, scale,
    position_offset, 
    window_size = 1024  // Sliding window
);
Maintains first 4 tokens (sink) + sliding window of 1024 tokens.

Memory-Mapped Weights

Weights are memory-mapped for efficient loading:
// Weights loaded on-demand via mmap
size_t weight = graph.mmap_weights("model/layer.0.weight");

// Pages released after use
graph.release_weight_pages(weight);

// Prefetch for next layer
graph.prefetch_weight_pages(next_weight);
Benefits:
  • Fast model loading (no copying into RAM)
  • Only active weights in memory
  • OS handles paging automatically
  • Supports models larger than RAM

Models

Supported models and their features

Quantization

INT4/INT8/FP16 precision options

Engine API

High-level API reference

Graph API

Computation graph reference