> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/cactus-compute/cactus/llms.txt
> Use this file to discover all available pages before exploring further.

# Fine-Tuning

> Train LoRA adapters with Unsloth and deploy them to iOS and Android devices using Cactus

# Fine-Tuning

Cactus enables you to train custom LoRA adapters on GPU and deploy them to mobile devices with minimal quality loss. This guide covers training with Unsloth, merging adapters, and deploying to phones.

## Overview

The fine-tuning workflow:

1. **Train on GPU** — Use Unsloth on Google Colab or local GPU to train LoRA adapters
2. **Merge & Convert** — Use `cactus convert` to merge adapter with base model and quantize
3. **Deploy to Mobile** — Package converted model with your iOS/Android app
4. **Run On-Device** — Inference runs entirely on-device with Cactus engine

## Training LoRA Adapters

### Prerequisites

* Google Colab with GPU (free tier works)
* OR local machine with CUDA GPU
* Unsloth library installed
* Training dataset in instruction format

### Basic Training Script

```python theme={null}
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-3-270m-it",  # or Qwen3-0.6B, LFM2-350M
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=None,
)

# Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                      # Rank: 16-32 recommended for mobile
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha=16,
    lora_dropout=0,            # 0 is optimal for inference
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# Prepare dataset
dataset = dataset.map(lambda x: {
    "text": tokenizer.apply_chat_template(
        [{"role": "user", "content": x["input"]},
         {"role": "assistant", "content": x["output"]}],
        tokenize=False
    )
})

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=2048,
    dataset_text_field="text",
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
    ),
)

trainer.train()

# Save adapter
model.save_pretrained("my-lora-adapter")
tokenizer.save_pretrained("my-lora-adapter")

# Optional: Push to Hub
model.push_to_hub("username/my-lora-adapter", token="...")
```

### Recommended Hyperparameters

#### For Mobile Deployment

| Parameter          | Recommended Value | Notes                                     |
| ------------------ | ----------------- | ----------------------------------------- |
| `r` (rank)         | 16-32             | Lower = smaller adapter, faster inference |
| `lora_alpha`       | Same as rank      | Typically set equal to rank               |
| `lora_dropout`     | 0                 | Dropout hurts mobile inference            |
| `max_seq_length`   | 2048              | Balance memory and context                |
| `learning_rate`    | 2e-4 to 5e-4      | Higher for small datasets                 |
| `num_train_epochs` | 3-5               | More epochs for small datasets            |

#### Target Modules

Always include these projection layers:

```python theme={null}
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                "gate_proj", "up_proj", "down_proj"]
```

<Warning>
  **Smaller Models for Mobile**: Use Gemma3-270m, Qwen3-0.6B, or LFM2-350M as base models. Larger models (>1B params) may not run smoothly on budget devices.
</Warning>

## Merging Adapters with Base Models

### Setup Cactus

```bash theme={null}
git clone https://github.com/cactus-compute/cactus
cd cactus
source ./setup
```

### Merge and Convert

The `cactus convert` command merges your LoRA adapter with the base model and converts to Cactus format:

```bash theme={null}
# From local adapter
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora ./my-lora-adapter

# From HuggingFace Hub
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora username/my-lora-adapter

# With INT8 quantization (better quality)
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b \
  --lora ./my-lora-adapter \
  --precision INT8

# With HuggingFace token (for gated models)
cactus convert google/gemma-3-1b-it ./my-gemma3 \
  --lora ./my-lora-adapter \
  --token hf_...
```

### Quantization During Merge

| Precision | Memory       | Quality | Best For                            |
| --------- | ------------ | ------- | ----------------------------------- |
| INT4      | Lowest (1x)  | Good    | Production, budget devices          |
| INT8      | Medium (2x)  | Better  | Mid-range devices, quality-critical |
| FP16      | Highest (4x) | Best    | Development, high-end only          |

**Recommendation**: Start with INT8 for testing, switch to INT4 for production if quality is acceptable.

<Note>
  Cactus v1.15+ uses **lossless quantization** techniques, providing 1.5x performance improvement while maintaining quality.
</Note>

## Deployment to Mobile

### iOS/macOS Deployment

#### 1. Build Native Library

```bash theme={null}
cactus build --apple
```

Output:

```
Build complete!
Static libraries:
  Device: /path/to/cactus/apple/libcactus-device.a
  Simulator: /path/to/cactus/apple/libcactus-simulator.a
XCFrameworks:
  iOS: /path/to/cactus/apple/cactus-ios.xcframework
  macOS: /path/to/cactus/apple/cactus-macos.xcframework
```

#### 2. Add to Xcode Project

1. Copy `my-qwen3-0.6b/` folder to your Xcode project
2. Link `cactus-ios.xcframework` in **Frameworks, Libraries, and Embedded Content**
3. Set framework to **Embed & Sign**

#### 3. Use in Swift

```swift theme={null}
import Foundation

class CactusModel {
    private var model: OpaquePointer?
    
    init(modelName: String) {
        let modelPath = Bundle.main.path(forResource: modelName, ofType: nil)!
        model = cactus_init(modelPath, nil, false)
    }
    
    func complete(messages: [[String: String]]) -> String {
        let jsonData = try! JSONSerialization.data(withJSONObject: messages)
        let messagesJson = String(data: jsonData, encoding: .utf8)!
        
        var response = [CChar](repeating: 0, count: 4096)
        cactus_complete(model, messagesJson, &response, response.count, 
                        nil, nil, nil, nil)
        
        return String(cString: response)
    }
    
    deinit {
        if let model = model {
            cactus_destroy(model)
        }
    }
}

// Usage
let model = CactusModel(modelName: "my-qwen3-0.6b")
let result = model.complete(messages: [
    ["role": "user", "content": "Hello!"]
])
print(result)
```

### Android Deployment

#### 1. Build Native Library

```bash theme={null}
cactus build --android
```

Output:

```
Build complete!
Shared library: /path/to/cactus/android/libcactus.so
Static library: /path/to/cactus/android/libcactus.a
```

#### 2. Add to Android Project

1. Copy `libcactus.so` to `app/src/main/jniLibs/arm64-v8a/`
2. Copy `my-qwen3-0.6b/` folder to `app/src/main/assets/`

#### 3. Use in Kotlin

```kotlin theme={null}
class CactusWrapper {
    init {
        System.loadLibrary("cactus")
    }
    
    external fun init(modelPath: String, contextSize: Long, corpusDir: String?): Long
    external fun complete(model: Long, messagesJson: String, bufferSize: Int): String
    external fun destroy(model: Long)
}

class CactusModel(context: Context, modelName: String) {
    private val cactus = CactusWrapper()
    private val model: Long
    
    init {
        // Copy model from assets to cache
        val modelDir = File(context.cacheDir, modelName)
        copyAssetFolder(context, modelName, modelDir.absolutePath)
        
        model = cactus.init(modelDir.absolutePath, 2048, null)
    }
    
    fun complete(messages: List<Map<String, String>>): String {
        val messagesJson = JSONArray(messages).toString()
        return cactus.complete(model, messagesJson, 4096)
    }
    
    fun close() {
        cactus.destroy(model)
    }
}

// Usage
val model = CactusModel(context, "my-qwen3-0.6b")
val result = model.complete(listOf(
    mapOf("role" to "user", "content" to "Hello!")
))
println(result)
model.close()
```

## Performance Benchmarks

### INT8 Qwen3-0.6B Fine-Tune

| Device           | Decode TPS  | RAM Usage |
| ---------------- | ----------- | --------- |
| iPhone 17 Pro    | 60-70 tok/s | \~200MB   |
| iPhone 13 Mini   | 25-35 tok/s | \~400MB   |
| Galaxy S25 Ultra | 30-40 tok/s | \~500MB   |
| Pixel 6a         | 13-18 tok/s | \~450MB   |
| Raspberry Pi 5   | 10-15 tok/s | \~350MB   |

### INT8 Gemma3-270m Task-Specific

| Device         | Decode TPS | RAM Usage |
| -------------- | ---------- | --------- |
| iPhone 17 Pro  | 150+ tok/s | \~120MB   |
| iPhone 13 Mini | 80+ tok/s  | \~200MB   |
| Raspberry Pi 5 | 23 tok/s   | \~200MB   |

## Testing Your Fine-Tune

### Local Testing (Mac/Linux)

```bash theme={null}
# Interactive playground
cactus run ./my-qwen3-0.6b

# Benchmark mode
cactus test --model ./my-qwen3-0.6b --benchmark
```

### On-Device Testing

```bash theme={null}
# Test on connected iPhone
cactus test --model ./my-qwen3-0.6b --ios

# Test on connected Android phone
cactus test --model ./my-qwen3-0.6b --android
```

<Note>
  Device must be connected via USB, unlocked, and trusted. For iOS, Xcode must be installed. For Android, USB debugging must be enabled.
</Note>

## Best Practices

### Training

1. **Start small** — Use Gemma3-270m or Qwen3-0.6B for mobile
2. **Low rank** — Use r=16 or r=32 to minimize adapter size
3. **No dropout** — Set `lora_dropout=0` for inference
4. **Validate quality** — Test on holdout set before deployment

### Deployment

1. **Test quantization** — Compare INT4 vs INT8 quality on your task
2. **Measure on-device** — Use `cactus test --ios/--android` for accurate benchmarks
3. **Monitor memory** — Check RAM usage under different context lengths
4. **Thermal management** — Long inference sessions may throttle on phones

## Troubleshooting

### Training Issues

**Out of memory during training**

```python theme={null}
# Reduce batch size
per_device_train_batch_size=1
gradient_accumulation_steps=8

# Use gradient checkpointing
use_gradient_checkpointing="unsloth"
```

**Poor validation loss**

* Increase training epochs
* Try higher learning rate (3e-4 to 5e-4)
* Add more training data
* Reduce rank if overfitting

### Deployment Issues

**Model too slow on device**

* Use INT4 quantization
* Switch to smaller base model
* Reduce KV cache window (see [Performance Tuning](/advanced/performance-tuning))

**Quality degraded after quantization**

* Use INT8 instead of INT4
* Verify training quality first
* Check adapter was properly merged

## See Also

* [Custom Models](/advanced/custom-models) — Model conversion and testing
* [Performance Tuning](/advanced/performance-tuning) — Optimize runtime performance
* [NPU Acceleration](/advanced/npu-acceleration) — Enable hardware acceleration
* [Unsloth Documentation](https://github.com/unslothai/unsloth) — Training library reference
