Architecture

Cactus is designed as a three-layer architecture that separates high-level AI workflows from low-level hardware optimizations. This design enables efficient on-device inference across diverse mobile hardware while maintaining clean, maintainable code.

Three-Layer Design

Cactus is built on a modular architecture that cleanly separates concerns:

┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for all major languages
└─────────────────┘     Chat, vision, STT, RAG, tool call, cloud handoff
         │
┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph (PyTorch for mobile)
└─────────────────┘     Custom models, optimised for RAM & quantisation
         │
┌─────────────────┐
│ Cactus Kernels  │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
└─────────────────┘     Custom attention, KV-cache quant, chunked prefill

Engine Layer (High-Level API)

The Engine Layer provides developer-friendly APIs for common AI tasks:

Text Completion: Chat, instruction following, conversation
Vision: Image understanding, multi-modal inputs
Transcription: Speech-to-text (Whisper, Parakeet, Moonshine)
Embeddings: Text and image embeddings for RAG
Tool Calling: Function calling with JSON schema validation
Cloud Handoff: Automatic fallback to cloud models based on confidence

Key Features:

OpenAI-compatible message format
Streaming token generation
RAG with automatic vector indexing
Multi-language SDKs (Python, Swift, Kotlin, Rust, Flutter)

See the Engine API Reference for complete documentation.

Graph Layer (Computation Graph)

The Graph Layer is a zero-copy computation graph framework inspired by PyTorch:

Node-Based Operations: Tensor operations as graph nodes
Lazy Execution: Build graph first, execute later
Memory Efficiency: Zero-copy operations, smart buffer pooling
Precision Support: INT4, INT8, FP16, FP32
Mixed Precision: Automatic precision casting where needed

Key Operations:

// Matrix operations
size_t result = graph.matmul(a, b, pretransposed_rhs);
size_t transposed = graph.transpose(input);

// Attention mechanisms
size_t attn = graph.attention(query, key, value, scale, 
                              position_offset, window_size);

// Normalization
size_t norm = graph.rms_norm(input, weight, epsilon);
size_t rope = graph.rope(input, theta, position_offset);

The graph layer handles precision conversions, broadcasting, and operator fusion automatically.See the Graph API Reference for complete documentation.

Kernel Layer (Hardware Optimizations)

The Kernel Layer contains highly optimized ARM SIMD implementations:

INT4/INT8 Quantized Operations: Group quantization with FP16 scales
Custom Attention: Optimized for mobile memory hierarchies
KV Cache Quantization: INT8 cache compression
NEON/SME2 SIMD: ARM vector intrinsics for maximum throughput
Cache-Friendly Access: Optimized memory access patterns

Key Kernels:

cactus_matmul_int4(): INT4 matrix multiplication
cactus_matmul_int8(): INT8 matrix multiplication with group scales
cactus_attention_f16(): FP16 attention with causal masking
cactus_attention_hybrid_int8_fp16(): Hybrid KV cache attention
cactus_quantize_kv_fp16_to_int8(): KV cache quantization

All kernels are implemented in cactus/kernel/ and optimized for Apple, Qualcomm, and Samsung processors.

Hybrid NPU/CPU Execution

Cactus intelligently distributes computation between NPU (Neural Processing Unit) and CPU for optimal performance:

Apple NPU
Qualcomm NPU
CPU Fallback

Supported Operations:

Matrix multiplication (up to 11x faster than CPU)
Convolution layers
Attention mechanisms
Layer normalization

Apple NPU Features:

Available on A14+ (iPhone 12+) and M1+ Macs
FP16 precision
Automatic chunked prefill for long sequences
Zero-copy integration with CPU execution

Performance Gains:

Device	CPU Only	With NPU	Speedup
iPhone 17 Pro	48 t/s	327 t/s	6.8x
Mac M4 Pro	100 t/s	582 t/s	5.8x
iPad M3	60 t/s	350 t/s	5.8x

Models with NPU Support:

LFM2-VL (vision models)
Whisper (all sizes)
Parakeet transcription models

Key Architectural Features

Chunked Prefill

Cactus processes long input sequences in chunks to maintain consistent decode speeds:

// Automatic chunking in prefill
void Model::prefill(const std::vector<uint32_t>& tokens, 
                   size_t chunk_size = 256) {
    // Process tokens in chunks of 256 (configurable)
    for (size_t i = 0; i < tokens.size(); i += chunk_size) {
        size_t chunk_end = std::min(i + chunk_size, tokens.size());
        // Process chunk...
    }
}

Benefits:

Long prefill (1000 tokens) = same decode speed as short prefill (10 tokens)
Prevents memory pressure during prefill
Enables streaming responses sooner

KV Cache Quantization

Cactus compresses the key-value cache using INT8 quantization with group scales:

struct KVCache {
    // INT8 quantized keys/values
    std::vector<int8_t> keys;
    std::vector<int8_t> values;
    
    // FP16 scales per group (default: 32 elements)
    std::vector<float> key_scales;
    std::vector<float> value_scales;
    
    size_t window_size = 1024;  // Sliding window
    size_t sink_size = 4;        // Keep first 4 tokens
};

Memory Savings:

2x reduction in cache size (FP16 → INT8)
No accuracy loss with group quantization
Critical for long context on mobile devices

Cactus Attention

Custom attention implementation optimized for mobile:

Standard Attention

size_t attention = graph.attention(
    query,           // [batch, seq_len, num_heads, head_dim]
    key,             // [batch, kv_len, num_kv_heads, head_dim]
    value,           // [batch, kv_len, num_kv_heads, head_dim]
    scale,           // 1/sqrt(head_dim)
    position_offset, // For incremental decoding
    window_size,     // Sliding window attention
    is_causal        // Causal masking
);

Optimizations:

Fused softmax and matrix multiply
Cache-friendly memory access
Multi-head attention with GQA (Grouped Query Attention)

Windowed Attention

Sliding window attention for long sequences:

// Only attend to last 1024 tokens
kv_cache.set_window_size(1024, sink_size = 4);

size_t attention = graph.attention(
    query, key, value, scale,
    position_offset, 
    window_size = 1024  // Sliding window
);

Maintains first 4 tokens (sink) + sliding window of 1024 tokens.

Memory-Mapped Weights

Weights are memory-mapped for efficient loading:

// Weights loaded on-demand via mmap
size_t weight = graph.mmap_weights("model/layer.0.weight");

// Pages released after use
graph.release_weight_pages(weight);

// Prefetch for next layer
graph.prefetch_weight_pages(next_weight);

Benefits:

Fast model loading (no copying into RAM)
Only active weights in memory
OS handles paging automatically
Supports models larger than RAM

Models

Supported models and their features

Quantization

INT4/INT8/FP16 precision options

Engine API

High-level API reference

Graph API

Computation graph reference

​Architecture

​Three-Layer Design

​Hybrid NPU/CPU Execution

​Key Architectural Features

​Chunked Prefill

​KV Cache Quantization

​Cactus Attention

​Memory-Mapped Weights

​Related Resources