Custom Models

Cactus supports converting custom models and fine-tuned adapters for deployment to mobile devices. This guide covers model conversion, quantization options, and testing your custom models.

Converting Models with LoRA

The cactus convert command merges LoRA adapters with base models and converts them to Cactus format.

Basic Conversion

# Convert from local LoRA adapter
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora ./my-lora-adapter

# Convert from HuggingFace Hub
cactus convert google/gemma-3-270m-it ./my-gemma3 --lora username/my-lora-adapter

# With specific quantization
cactus convert LiquidAI/LFM2.5-1.2B-Instruct ./my-lfm --lora ./adapters/my-lora --precision INT8

Command Options

Flag	Description	Default
`--precision INT4\|INT8\|FP16`	Weight quantization level	`INT4`
`--lora <path>`	Path to LoRA adapter (local or HF Hub)	None
`--token <token>`	HuggingFace API token for gated models	None
`--reconvert`	Force reconversion from source	False

Base Model Match: Always use the correct base model that matches your LoRA adapter. Mismatched base models will produce incorrect outputs.

Model Format Requirements

Supported Base Models

Cactus supports the following model architectures:

Gemma 3: google/gemma-3-270m-it, google/gemma-3-1b-it
Qwen 3: Qwen/Qwen3-0.6B, Qwen/Qwen3-1.7B
LFM 2/2.5: LiquidAI/LFM2-350M, LiquidAI/LFM2.5-1.2B-Instruct, LiquidAI/LFM2-8B-A1B
SmolLM 2: Coming soon

For the complete list, see the Supported Models section in the README.

LoRA Adapter Format

Your LoRA adapter must:

Be trained with Unsloth, PEFT, or compatible LoRA libraries
Target standard transformer modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Use rank (r) between 8-64 (recommended: 16-32 for mobile)
Include a valid adapter_config.json file

Weight Quantization Options

INT4 (Default)

cactus convert Qwen/Qwen3-0.6B ./my-model --lora ./adapter --precision INT4

Benefits: ~50% memory reduction vs INT8, fastest inference
Trade-offs: Minimal quality loss with task-specific fine-tunes
Best for: Production deployment on budget devices

INT8

cactus convert Qwen/Qwen3-0.6B ./my-model --lora ./adapter --precision INT8

Benefits: Better quality retention than INT4
Trade-offs: 2x memory usage vs INT4
Best for: Quality-critical applications, mid-range devices
Performance: 60-70 tok/s on iPhone 17 Pro, 13-18 tok/s on Pixel 6a

FP16

cactus convert Qwen/Qwen3-0.6B ./my-model --lora ./adapter --precision FP16

Benefits: Full precision, no quality loss
Trade-offs: 4x memory usage vs INT4, slower inference
Best for: Development, benchmarking, high-end devices only

Quantization is Lossless: Cactus v1.15+ uses hybrid inference with lossless quantization techniques, providing 1.5x performance improvement while maintaining quality.

Testing Converted Models

Interactive Testing (Mac/Linux)

# Test your converted model locally
cactus run ./my-qwen3-0.6b

This opens an interactive playground where you can test completions, tool calls, and streaming.

Benchmark Mode

# Run performance benchmarks
cactus test --model ./my-qwen3-0.6b --benchmark

Outputs:

Prefill tokens per second (TPS)
Decode tokens per second
Time to first token
RAM usage
Model confidence scores

Testing on iOS Device

# Build and test on connected iPhone
cactus build --apple
cactus test --model ./my-model --ios

Requires:

Xcode installed
iPhone connected via USB
Device unlocked and trusted

Testing on Android Device

# Build and test on connected Android phone
cactus build --android
cactus test --model ./my-model --android

Requires:

Android SDK/NDK installed
USB debugging enabled
Device connected via ADB

Example: End-to-End Custom Model

1. Train LoRA Adapter (Colab/GPU)

from unsloth import FastLanguageModel
from trl import SFTTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-0.6B",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    use_gradient_checkpointing="unsloth",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=2048,
)

trainer.train()
model.save_pretrained("./task-specific-adapter")

2. Convert for Cactus

cactus convert Qwen/Qwen3-0.6B ./qwen3-task-specific \
  --lora ./task-specific-adapter \
  --precision INT8

3. Test Locally

cactus run ./qwen3-task-specific

4. Deploy to iOS

cactus build --apple
# Copy model folder to Xcode project
# Link cactus-ios.xcframework

let modelPath = Bundle.main.path(forResource: "qwen3-task-specific", ofType: nil)!
let model = cactus_init(modelPath, nil, false)

let messages = "[{\"role\":\"user\",\"content\":\"Your query\"}]"
var response = [CChar](repeating: 0, count: 4096)
cactus_complete(model, messages, &response, response.count, nil, nil, nil, nil)

print(String(cString: response))
cactus_destroy(model)

Performance Expectations

INT8 Qwen3-0.6B (Custom Fine-Tune)

Device	Decode TPS	RAM Usage
iPhone 17 Pro	60-70	~200MB
iPhone 13 Mini	25-35	~400MB
Galaxy S25 Ultra	30-40	~500MB
Pixel 6a	13-18	~450MB
Raspberry Pi 5	10-15	~350MB

INT8 Gemma3-270m (Task-Specific)

Device	Decode TPS	RAM Usage
iPhone 17 Pro	150+	~120MB
Raspberry Pi 5	23	~200MB

Performance varies based on model complexity, context length, and device thermal state. Use --benchmark flag for accurate measurements on your target device.

Troubleshooting

Conversion Fails

Error: Base model architecture mismatch

Solution: Verify your LoRA adapter was trained on the exact base model you’re converting. Check the adapter’s adapter_config.json file.

Poor Quality After Conversion

Try INT8 instead of INT4: --precision INT8
Verify adapter trained properly (check validation loss)
Test with different prompts and temperatures

Model Too Large for Device

Use INT4 quantization: --precision INT4
Try a smaller base model (e.g., Gemma3-270m instead of Qwen3-1.7B)
Reduce context window at runtime (see Performance Tuning)

​Custom Models

​Converting Models with LoRA

​Basic Conversion

​Command Options

​Model Format Requirements

​Supported Base Models

​LoRA Adapter Format

​Weight Quantization Options

​INT4 (Default)

​INT8

​FP16

​Testing Converted Models

​Interactive Testing (Mac/Linux)

​Benchmark Mode

​Testing on iOS Device

​Testing on Android Device

​Example: End-to-End Custom Model

​1. Train LoRA Adapter (Colab/GPU)

​2. Convert for Cactus

​3. Test Locally

​4. Deploy to iOS

​Performance Expectations

​INT8 Qwen3-0.6B (Custom Fine-Tune)

​INT8 Gemma3-270m (Task-Specific)

​Troubleshooting

​Conversion Fails

​Poor Quality After Conversion

​Model Too Large for Device

​See Also

Custom Models

Converting Models with LoRA

Basic Conversion

Command Options

Model Format Requirements

Supported Base Models

LoRA Adapter Format

Weight Quantization Options

INT4 (Default)

INT8

FP16

Testing Converted Models

Interactive Testing (Mac/Linux)

Benchmark Mode

Testing on iOS Device

Testing on Android Device

Example: End-to-End Custom Model

1. Train LoRA Adapter (Colab/GPU)

2. Convert for Cactus

3. Test Locally

4. Deploy to iOS

Performance Expectations

INT8 Qwen3-0.6B (Custom Fine-Tune)

INT8 Gemma3-270m (Task-Specific)

Troubleshooting

Conversion Fails

Poor Quality After Conversion

Model Too Large for Device

See Also