Custom Models
Cactus supports converting custom models and fine-tuned adapters for deployment to mobile devices. This guide covers model conversion, quantization options, and testing your custom models.
Converting Models with LoRA
The cactus convert command merges LoRA adapters with base models and converts them to Cactus format.
Basic Conversion
# Convert from local LoRA adapter
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora ./my-lora-adapter
# Convert from HuggingFace Hub
cactus convert google/gemma-3-270m-it ./my-gemma3 --lora username/my-lora-adapter
# With specific quantization
cactus convert LiquidAI/LFM2.5-1.2B-Instruct ./my-lfm --lora ./adapters/my-lora --precision INT8
Command Options
| Flag | Description | Default |
|---|
--precision INT4|INT8|FP16 | Weight quantization level | INT4 |
--lora <path> | Path to LoRA adapter (local or HF Hub) | None |
--token <token> | HuggingFace API token for gated models | None |
--reconvert | Force reconversion from source | False |
Base Model Match: Always use the correct base model that matches your LoRA adapter. Mismatched base models will produce incorrect outputs.
Supported Base Models
Cactus supports the following model architectures:
- Gemma 3:
google/gemma-3-270m-it, google/gemma-3-1b-it
- Qwen 3:
Qwen/Qwen3-0.6B, Qwen/Qwen3-1.7B
- LFM 2/2.5:
LiquidAI/LFM2-350M, LiquidAI/LFM2.5-1.2B-Instruct, LiquidAI/LFM2-8B-A1B
- SmolLM 2: Coming soon
For the complete list, see the Supported Models section in the README.
Your LoRA adapter must:
- Be trained with Unsloth, PEFT, or compatible LoRA libraries
- Target standard transformer modules:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Use rank (r) between 8-64 (recommended: 16-32 for mobile)
- Include a valid
adapter_config.json file
Weight Quantization Options
INT4 (Default)
cactus convert Qwen/Qwen3-0.6B ./my-model --lora ./adapter --precision INT4
- Benefits: ~50% memory reduction vs INT8, fastest inference
- Trade-offs: Minimal quality loss with task-specific fine-tunes
- Best for: Production deployment on budget devices
INT8
cactus convert Qwen/Qwen3-0.6B ./my-model --lora ./adapter --precision INT8
- Benefits: Better quality retention than INT4
- Trade-offs: 2x memory usage vs INT4
- Best for: Quality-critical applications, mid-range devices
- Performance: 60-70 tok/s on iPhone 17 Pro, 13-18 tok/s on Pixel 6a
FP16
cactus convert Qwen/Qwen3-0.6B ./my-model --lora ./adapter --precision FP16
- Benefits: Full precision, no quality loss
- Trade-offs: 4x memory usage vs INT4, slower inference
- Best for: Development, benchmarking, high-end devices only
Quantization is Lossless: Cactus v1.15+ uses hybrid inference with lossless quantization techniques, providing 1.5x performance improvement while maintaining quality.
Testing Converted Models
Interactive Testing (Mac/Linux)
# Test your converted model locally
cactus run ./my-qwen3-0.6b
This opens an interactive playground where you can test completions, tool calls, and streaming.
Benchmark Mode
# Run performance benchmarks
cactus test --model ./my-qwen3-0.6b --benchmark
Outputs:
- Prefill tokens per second (TPS)
- Decode tokens per second
- Time to first token
- RAM usage
- Model confidence scores
Testing on iOS Device
# Build and test on connected iPhone
cactus build --apple
cactus test --model ./my-model --ios
Requires:
- Xcode installed
- iPhone connected via USB
- Device unlocked and trusted
Testing on Android Device
# Build and test on connected Android phone
cactus build --android
cactus test --model ./my-model --android
Requires:
- Android SDK/NDK installed
- USB debugging enabled
- Device connected via ADB
Example: End-to-End Custom Model
1. Train LoRA Adapter (Colab/GPU)
from unsloth import FastLanguageModel
from trl import SFTTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-0.6B",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
use_gradient_checkpointing="unsloth",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
max_seq_length=2048,
)
trainer.train()
model.save_pretrained("./task-specific-adapter")
2. Convert for Cactus
cactus convert Qwen/Qwen3-0.6B ./qwen3-task-specific \
--lora ./task-specific-adapter \
--precision INT8
3. Test Locally
cactus run ./qwen3-task-specific
4. Deploy to iOS
cactus build --apple
# Copy model folder to Xcode project
# Link cactus-ios.xcframework
let modelPath = Bundle.main.path(forResource: "qwen3-task-specific", ofType: nil)!
let model = cactus_init(modelPath, nil, false)
let messages = "[{\"role\":\"user\",\"content\":\"Your query\"}]"
var response = [CChar](repeating: 0, count: 4096)
cactus_complete(model, messages, &response, response.count, nil, nil, nil, nil)
print(String(cString: response))
cactus_destroy(model)
INT8 Qwen3-0.6B (Custom Fine-Tune)
| Device | Decode TPS | RAM Usage |
|---|
| iPhone 17 Pro | 60-70 | ~200MB |
| iPhone 13 Mini | 25-35 | ~400MB |
| Galaxy S25 Ultra | 30-40 | ~500MB |
| Pixel 6a | 13-18 | ~450MB |
| Raspberry Pi 5 | 10-15 | ~350MB |
INT8 Gemma3-270m (Task-Specific)
| Device | Decode TPS | RAM Usage |
|---|
| iPhone 17 Pro | 150+ | ~120MB |
| Raspberry Pi 5 | 23 | ~200MB |
Performance varies based on model complexity, context length, and device thermal state. Use --benchmark flag for accurate measurements on your target device.
Troubleshooting
Conversion Fails
Error: Base model architecture mismatch
Solution: Verify your LoRA adapter was trained on the exact base model you’re converting. Check the adapter’s adapter_config.json file.
Poor Quality After Conversion
- Try INT8 instead of INT4:
--precision INT8
- Verify adapter trained properly (check validation loss)
- Test with different prompts and temperatures
Model Too Large for Device
- Use INT4 quantization:
--precision INT4
- Try a smaller base model (e.g., Gemma3-270m instead of Qwen3-1.7B)
- Reduce context window at runtime (see Performance Tuning)
See Also