Quickstart
Get up and running with Cactus in under 5 minutes. This guide will walk you through installation, model download, and your first AI inference.
Prerequisites
- macOS: Homebrew installed
- Linux: Ubuntu 20.04+ or Debian 11+ with
python3, cmake, build-essential
- Disk space: ~500MB for Cactus + 200-500MB per model
For iOS, Android, and other platforms, see the full Installation guide.
Step 1: Install Cactus
brew install cactus-compute/cactus/cactus
The setup script will:
- Create a Python 3.12 virtual environment
- Install Python dependencies
- Install the
cactus CLI tool
Linux requires Python 3.12. Install it with:sudo apt-get install python3.12 python3.12-venv
Step 2: Download a Model
Cactus automatically downloads models, but you can pre-download them:
cactus download google/gemma-3-270m-it
This downloads the Gemma 270M model quantized to INT4 (~150MB) to the weights/ directory.
Models are downloaded from HuggingFace and automatically converted to Cactus format with INT4 quantization.
Step 3: Run Your First Inference
Option A: Interactive Playground (CLI)
The fastest way to get started:
cactus run google/gemma-3-270m-it
This opens an interactive chat playground:
┌──────────────────────────────────────────────────┐
│ Cactus Playground - gemma-3-270m-it │
└──────────────────────────────────────────────────┘
Loaded model: google/gemma-3-270m-it (INT4)
RAM usage: 167MB
You: What is 2+2?
Assistant: 2+2 equals 4.
Time to first token: 45.2ms
Total time: 163.7ms
Prefill: 619.5 tokens/sec
Decode: 168.4 tokens/sec
Option B: C API Example
Create hello.c:
#include "cactus.h"
#include <stdio.h>
int main() {
// Initialize model
cactus_model_t model = cactus_init(
"weights/gemma-3-270m-it",
NULL // no RAG corpus
);
if (!model) {
fprintf(stderr, "Failed to load model\n");
return 1;
}
// Prepare messages
const char* messages =
"[{\"role\": \"user\", \"content\": \"What is 2+2?\"}]";
// Run inference
char response[4096];
int result = cactus_complete(
model, // model handle
messages, // JSON messages array
response, // response buffer
sizeof(response), // buffer size
NULL, // options (use defaults)
NULL, // tools (none)
NULL, // callback (no streaming)
NULL // user data
);
if (result > 0) {
printf("Response: %s\n", response);
}
// Cleanup
cactus_destroy(model);
return 0;
}
Compile and run:
gcc hello.c -I./cactus -L./build -lcactus -o hello
./hello
Expected output:
{
"success": true,
"response": "2+2 equals 4.",
"function_calls": [],
"cloud_handoff": false,
"confidence": 0.92,
"time_to_first_token_ms": 45.2,
"total_time_ms": 163.7,
"prefill_tps": 619.5,
"decode_tps": 168.4,
"ram_usage_mb": 167.3,
"prefill_tokens": 28,
"decode_tokens": 6,
"total_tokens": 34
}
Option C: Python SDK Example
Create hello.py:
from cactus import cactus_init, cactus_complete, cactus_destroy
import json
# Initialize model
model = cactus_init("weights/gemma-3-270m-it", None, False)
# Prepare messages
messages = json.dumps([
{"role": "user", "content": "What is 2+2?"}
])
# Run inference
result_json = cactus_complete(model, messages, None, None, None)
result = json.loads(result_json)
print(f"Response: {result['response']}")
print(f"Time to first token: {result['time_to_first_token_ms']:.1f}ms")
print(f"Decode speed: {result['decode_tps']:.1f} tokens/sec")
# Cleanup
cactus_destroy(model)
Run:
Expected output:
Response: 2+2 equals 4.
Time to first token: 45.2ms
Decode speed: 168.4 tokens/sec
Option D: Swift SDK Example (iOS/macOS)
Create a new Swift file:
import Foundation
// Initialize model
let model = try cactusInit(
"/path/to/weights/gemma-3-270m-it",
nil, // no RAG corpus
false // don't cache index
)
defer { cactusDestroy(model) }
// Prepare messages
let messages = #"[{"role":"user","content":"What is 2+2?"}]"#
// Run inference
let resultJson = try cactusComplete(model, messages, nil, nil, nil)
if let data = resultJson.data(using: .utf8),
let result = try? JSONSerialization.jsonObject(with: data) as? [String: Any] {
print("Response: \(result["response"] ?? "")")
print("TTFT: \(result["time_to_first_token_ms"] ?? 0)ms")
}
Option E: Kotlin SDK Example (Android)
import com.cactus.*
import org.json.JSONObject
// Initialize model
val model = cactusInit("/path/to/weights/gemma-3-270m-it", null, false)
// Prepare messages
val messages = """[{"role":"user","content":"What is 2+2?"}]"""
// Run inference
val resultJson = cactusComplete(model, messages, null, null, null)
val result = JSONObject(resultJson)
println("Response: ${result.getString("response")}")
println("TTFT: ${result.getDouble("time_to_first_token_ms")}ms")
// Cleanup
cactusDestroy(model)
Step 4: Try Streaming Output
For a better user experience, stream tokens as they’re generated:
def on_token(token, token_id):
print(token, end="", flush=True)
options = json.dumps({"max_tokens": 100})
result = cactus_complete(model, messages, options, None, on_token)
print() # newline after streaming
Step 5: Try More Models
Vision-Language Model
Run image understanding:
cactus download LiquidAI/LFM2-VL-450M
messages = json.dumps([{
"role": "user",
"content": "Describe this image",
"images": ["/path/to/photo.jpg"]
}])
model = cactus_init("weights/lfm2-vl-450m", None, False)
result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])
Speech-to-Text Model
Transcribe audio:
cactus transcribe openai/whisper-small --file audio.wav
Or use the API:
from cactus import cactus_init, cactus_transcribe, cactus_destroy
import json
model = cactus_init("weights/whisper-small", None, False)
result = json.loads(cactus_transcribe(
model,
"/path/to/audio.wav", # audio file
None, # no prompt
None, # default options
None, # no streaming callback
None # no PCM buffer (using file)
))
print(f"Transcription: {result['response']}")
cactus_destroy(model)
Next Steps
Common Options
Customize inference with options:
{
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0.95,
"top_k": 40,
"stop_sequences": ["<|im_end|>"],
"confidence_threshold": 0.7
}
| Option | Default | Description |
|---|
max_tokens | 100 | Maximum tokens to generate |
temperature | 0.0 | Sampling randomness (0.0 = greedy) |
top_p | 0.0 | Nucleus sampling threshold |
top_k | 0 | Top-k sampling (0 = disabled) |
stop_sequences | [] | Stop generation on these strings |
confidence_threshold | 0.7 | Minimum confidence before cloud handoff |
Troubleshooting
Model fails to load
# Check model exists
ls weights/gemma-3-270m-it
# Re-download if corrupted
cactus download google/gemma-3-270m-it --reconvert
Python import error
# Rebuild Python bindings
cactus build --python
# Verify virtual environment is active
which python # should show ./venv/bin/python
Out of memory
# Use a smaller model
cactus download google/gemma-3-270m-it # ~150MB RAM
# Or increase quantization
cactus download LiquidAI/LFM2-1.2B --precision INT4
If you encounter a “Failed to initialize model” error, check that:
- The model path is correct (use absolute paths)
- The model was fully downloaded (check file sizes in
weights/)
- You have sufficient RAM (at least 500MB free)
Get Help
Need assistance?
Join the community to stay updated on new models, features, and optimizations.