Supported Models
Cactus supports a growing list of state-of-the-art models optimized for mobile and edge devices. All models support INT4, INT8, and FP16 quantization.Language Models
Text generation models for chat, completion, and tool calling.Gemma 3 Models (Google)
Gemma 3 Models (Google)
| Model | Size | Features | RAM (INT4) |
|---|---|---|---|
| google/gemma-3-270m-it | 270M | completion | ~200MB |
| google/functiongemma-270m-it | 270M | completion, tools | ~200MB |
| google/gemma-3-1b-it | 1B | completion | ~800MB |
LFM2 Models (Liquid AI)
LFM2 Models (Liquid AI)
| Model | Size | Features | RAM (INT4) |
|---|---|---|---|
| LiquidAI/LFM2-350M | 350M | completion, tools, embed | ~250MB |
| LiquidAI/LFM2-700M | 700M | completion, tools, embed | ~500MB |
| LiquidAI/LFM2.5-1.2B-Thinking | 1.2B | completion, tools, embed | ~700MB |
| LiquidAI/LFM2.5-1.2B-Instruct | 1.2B | completion, tools, embed | ~700MB |
| LiquidAI/LFM2-2.6B | 2.6B | completion, tools, embed | ~1.8GB |
| LiquidAI/LFM2-8B-A1B | 8B (1B active) | completion, tools, embed | ~6GB |
| Device | Prefill | Decode | RAM |
|---|---|---|---|
| Mac M4 Pro | 582 t/s | 100 t/s | 76MB |
| iPhone 17 Pro | 327 t/s | 48 t/s | 108MB |
| Galaxy S25 Ultra | 255 t/s | 37 t/s | 1.5GB |
Qwen 3 Models
Qwen 3 Models
| Model | Size | Features | RAM (INT4) |
|---|---|---|---|
| Qwen/Qwen3-0.6B | 600M | completion, tools, embed | ~400MB |
| Qwen/Qwen3-1.7B | 1.7B | completion, tools, embed | ~1.2GB |
| Qwen/Qwen3-Embedding-0.6B | 600M | embed | ~400MB |
Vision Models
Multi-modal models that understand both text and images.LFM2-VL (Liquid AI)
LFM2-VL (Liquid AI)
| Model | Size | Features | RAM (INT4) |
|---|---|---|---|
| LiquidAI/LFM2-VL-450M | 450M | vision, txt & img embed | ~300MB |
| LiquidAI/LFM2.5-VL-1.6B | 1.6B | vision, txt & img embed | ~1.1GB |
| Device | First Token | Decode |
|---|---|---|
| Mac M4 Pro | 0.2s | 98 t/s |
| iPad M3 | 0.3s | 69 t/s |
| iPhone 17 Pro | 0.3s | 48 t/s |
| Galaxy S25 Ultra | - | 34 t/s |
Transcription Models
Speech-to-text models for audio transcription.- Whisper (OpenAI)
- Parakeet (NVIDIA)
- Moonshine (Useful Sensors)
| Model | Size | Features | RAM (INT4) | NPU |
|---|---|---|---|---|
| openai/whisper-tiny | 39M | transcription, embed | ~100MB | ✅ |
| openai/whisper-base | 74M | transcription, embed | ~150MB | ✅ |
| openai/whisper-small | 244M | transcription, embed | ~200MB | ✅ |
| openai/whisper-medium | 769M | transcription, embed | ~600MB | ✅ |
Specialized Models
Voice Activity Detection (VAD)
Voice Activity Detection (VAD)
| Model | Size | Features | RAM |
|---|---|---|---|
| snakers4/silero-vad | 1.5M | vad | ~10MB |
Embedding Models
Embedding Models
| Model | Size | Features | RAM (INT4) |
|---|---|---|---|
| nomic-ai/nomic-embed-text-v2-moe | 137M | embed | ~100MB |
| Qwen/Qwen3-Embedding-0.6B | 600M | embed | ~400MB |
Model Download & Conversion
Downloading Models
Use thecactus download command to fetch and convert models:
Converting Custom Models
Convert your own fine-tuned models:- Gemma (1, 2, 3)
- Qwen (2, 3)
- LFM2 / LFM2.5
- Whisper
- Parakeet (CTC, TDT)
- SigLIP-2 (vision encoders)
Model Storage
Models are stored inweights/ directory:
config.json: Model configurationtokenizer.json: BPE tokenizer*.weight: Memory-mapped weight files (one per layer)
RAM Usage & Performance
Memory Requirements by Precision
| Precision | Memory per Param | 1B Model | 2.6B Model |
|---|---|---|---|
| INT4 | 0.5 bytes | ~500MB | ~1.3GB |
| INT8 | 1 byte | ~1GB | ~2.6GB |
| FP16 | 2 bytes | ~2GB | ~5.2GB |
Device Recommendations
- High-End Phones
- Mid-Range Phones
- Budget Phones
iPhone 15 Pro+, Galaxy S24 Ultra, Pixel 9 Pro
- LFM2.5-1.2B (INT4) - Excellent
- Gemma-3-1B (INT4) - Excellent
- LFM2-VL-1.6B (INT4) - Good
- Whisper-Small (INT4) - Excellent
Related Resources
Architecture
How Cactus’s three-layer design works
Quantization
INT4/INT8/FP16 precision guide
Engine API
Using models in your app
Fine-Tuning
Train custom models