Skip to main content

Quickstart

Get up and running with Cactus in under 5 minutes. This guide will walk you through installation, model download, and your first AI inference.

Prerequisites

  • macOS: Homebrew installed
  • Linux: Ubuntu 20.04+ or Debian 11+ with python3, cmake, build-essential
  • Disk space: ~500MB for Cactus + 200-500MB per model
For iOS, Android, and other platforms, see the full Installation guide.

Step 1: Install Cactus

brew install cactus-compute/cactus/cactus
The setup script will:
  • Create a Python 3.12 virtual environment
  • Install Python dependencies
  • Install the cactus CLI tool
Linux requires Python 3.12. Install it with:
sudo apt-get install python3.12 python3.12-venv

Step 2: Download a Model

Cactus automatically downloads models, but you can pre-download them:
cactus download google/gemma-3-270m-it
This downloads the Gemma 270M model quantized to INT4 (~150MB) to the weights/ directory.
Models are downloaded from HuggingFace and automatically converted to Cactus format with INT4 quantization.

Step 3: Run Your First Inference

Option A: Interactive Playground (CLI)

The fastest way to get started:
cactus run google/gemma-3-270m-it
This opens an interactive chat playground:
┌──────────────────────────────────────────────────┐
│  Cactus Playground - gemma-3-270m-it             │
└──────────────────────────────────────────────────┘

Loaded model: google/gemma-3-270m-it (INT4)
RAM usage: 167MB

You: What is 2+2?
Assistant: 2+2 equals 4.

  Time to first token: 45.2ms
  Total time: 163.7ms
  Prefill: 619.5 tokens/sec
  Decode: 168.4 tokens/sec

Option B: C API Example

Create hello.c:
#include "cactus.h"
#include <stdio.h>

int main() {
    // Initialize model
    cactus_model_t model = cactus_init(
        "weights/gemma-3-270m-it",
        NULL  // no RAG corpus
    );
    
    if (!model) {
        fprintf(stderr, "Failed to load model\n");
        return 1;
    }
    
    // Prepare messages
    const char* messages = 
        "[{\"role\": \"user\", \"content\": \"What is 2+2?\"}]";
    
    // Run inference
    char response[4096];
    int result = cactus_complete(
        model,            // model handle
        messages,         // JSON messages array
        response,         // response buffer
        sizeof(response), // buffer size
        NULL,             // options (use defaults)
        NULL,             // tools (none)
        NULL,             // callback (no streaming)
        NULL              // user data
    );
    
    if (result > 0) {
        printf("Response: %s\n", response);
    }
    
    // Cleanup
    cactus_destroy(model);
    return 0;
}
Compile and run:
gcc hello.c -I./cactus -L./build -lcactus -o hello
./hello
Expected output:
{
  "success": true,
  "response": "2+2 equals 4.",
  "function_calls": [],
  "cloud_handoff": false,
  "confidence": 0.92,
  "time_to_first_token_ms": 45.2,
  "total_time_ms": 163.7,
  "prefill_tps": 619.5,
  "decode_tps": 168.4,
  "ram_usage_mb": 167.3,
  "prefill_tokens": 28,
  "decode_tokens": 6,
  "total_tokens": 34
}

Option C: Python SDK Example

Create hello.py:
from cactus import cactus_init, cactus_complete, cactus_destroy
import json

# Initialize model
model = cactus_init("weights/gemma-3-270m-it", None, False)

# Prepare messages
messages = json.dumps([
    {"role": "user", "content": "What is 2+2?"}
])

# Run inference
result_json = cactus_complete(model, messages, None, None, None)
result = json.loads(result_json)

print(f"Response: {result['response']}")
print(f"Time to first token: {result['time_to_first_token_ms']:.1f}ms")
print(f"Decode speed: {result['decode_tps']:.1f} tokens/sec")

# Cleanup
cactus_destroy(model)
Run:
python hello.py
Expected output:
Response: 2+2 equals 4.
Time to first token: 45.2ms
Decode speed: 168.4 tokens/sec

Option D: Swift SDK Example (iOS/macOS)

Create a new Swift file:
import Foundation

// Initialize model
let model = try cactusInit(
    "/path/to/weights/gemma-3-270m-it",
    nil,   // no RAG corpus
    false  // don't cache index
)
defer { cactusDestroy(model) }

// Prepare messages
let messages = #"[{"role":"user","content":"What is 2+2?"}]"#

// Run inference
let resultJson = try cactusComplete(model, messages, nil, nil, nil)

if let data = resultJson.data(using: .utf8),
   let result = try? JSONSerialization.jsonObject(with: data) as? [String: Any] {
    print("Response: \(result["response"] ?? "")")
    print("TTFT: \(result["time_to_first_token_ms"] ?? 0)ms")
}

Option E: Kotlin SDK Example (Android)

import com.cactus.*
import org.json.JSONObject

// Initialize model
val model = cactusInit("/path/to/weights/gemma-3-270m-it", null, false)

// Prepare messages  
val messages = """[{"role":"user","content":"What is 2+2?"}]"""

// Run inference
val resultJson = cactusComplete(model, messages, null, null, null)
val result = JSONObject(resultJson)

println("Response: ${result.getString("response")}")
println("TTFT: ${result.getDouble("time_to_first_token_ms")}ms")

// Cleanup
cactusDestroy(model)

Step 4: Try Streaming Output

For a better user experience, stream tokens as they’re generated:
def on_token(token, token_id):
    print(token, end="", flush=True)

options = json.dumps({"max_tokens": 100})
result = cactus_complete(model, messages, options, None, on_token)
print()  # newline after streaming

Step 5: Try More Models

Vision-Language Model

Run image understanding:
cactus download LiquidAI/LFM2-VL-450M
messages = json.dumps([{
    "role": "user",
    "content": "Describe this image",
    "images": ["/path/to/photo.jpg"]
}])

model = cactus_init("weights/lfm2-vl-450m", None, False)
result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

Speech-to-Text Model

Transcribe audio:
cactus transcribe openai/whisper-small --file audio.wav
Or use the API:
from cactus import cactus_init, cactus_transcribe, cactus_destroy
import json

model = cactus_init("weights/whisper-small", None, False)
result = json.loads(cactus_transcribe(
    model,
    "/path/to/audio.wav",  # audio file
    None,                  # no prompt
    None,                  # default options
    None,                  # no streaming callback
    None                   # no PCM buffer (using file)
))

print(f"Transcription: {result['response']}")
cactus_destroy(model)

Next Steps

1

Explore the API

Learn about all available functions in the Engine API Reference.
2

Platform-Specific Setup

See detailed installation for iOS, Android, or Linux.
3

Advanced Features

Explore Tool Calling, RAG, and Embeddings.
4

Production Deployment

Common Options

Customize inference with options:
{
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0.95,
  "top_k": 40,
  "stop_sequences": ["<|im_end|>"],
  "confidence_threshold": 0.7
}
OptionDefaultDescription
max_tokens100Maximum tokens to generate
temperature0.0Sampling randomness (0.0 = greedy)
top_p0.0Nucleus sampling threshold
top_k0Top-k sampling (0 = disabled)
stop_sequences[]Stop generation on these strings
confidence_threshold0.7Minimum confidence before cloud handoff

Troubleshooting

Model fails to load

# Check model exists
ls weights/gemma-3-270m-it

# Re-download if corrupted
cactus download google/gemma-3-270m-it --reconvert

Python import error

# Rebuild Python bindings
cactus build --python

# Verify virtual environment is active
which python  # should show ./venv/bin/python

Out of memory

# Use a smaller model
cactus download google/gemma-3-270m-it  # ~150MB RAM

# Or increase quantization
cactus download LiquidAI/LFM2-1.2B --precision INT4
If you encounter a “Failed to initialize model” error, check that:
  1. The model path is correct (use absolute paths)
  2. The model was fully downloaded (check file sizes in weights/)
  3. You have sufficient RAM (at least 500MB free)

Get Help

Need assistance?
Join the community to stay updated on new models, features, and optimizations.