Quickstart

Get up and running with Cactus in under 5 minutes. This guide will walk you through installation, model download, and your first AI inference.

Prerequisites

macOS: Homebrew installed
Linux: Ubuntu 20.04+ or Debian 11+ with python3, cmake, build-essential
Disk space: ~500MB for Cactus + 200-500MB per model

For iOS, Android, and other platforms, see the full Installation guide.

Step 1: Install Cactus

brew install cactus-compute/cactus/cactus

The setup script will:

Create a Python 3.12 virtual environment
Install Python dependencies
Install the cactus CLI tool

Linux requires Python 3.12. Install it with:

sudo apt-get install python3.12 python3.12-venv

Step 2: Download a Model

Cactus automatically downloads models, but you can pre-download them:

cactus download google/gemma-3-270m-it

This downloads the Gemma 270M model quantized to INT4 (~150MB) to the weights/ directory.

Models are downloaded from HuggingFace and automatically converted to Cactus format with INT4 quantization.

Step 3: Run Your First Inference

Option A: Interactive Playground (CLI)

The fastest way to get started:

cactus run google/gemma-3-270m-it

This opens an interactive chat playground:

┌──────────────────────────────────────────────────┐
│  Cactus Playground - gemma-3-270m-it             │
└──────────────────────────────────────────────────┘

Loaded model: google/gemma-3-270m-it (INT4)
RAM usage: 167MB

You: What is 2+2?
Assistant: 2+2 equals 4.

  Time to first token: 45.2ms
  Total time: 163.7ms
  Prefill: 619.5 tokens/sec
  Decode: 168.4 tokens/sec

Option B: C API Example

Create hello.c:

#include "cactus.h"
#include <stdio.h>

int main() {
    // Initialize model
    cactus_model_t model = cactus_init(
        "weights/gemma-3-270m-it",
        NULL  // no RAG corpus
    );
    
    if (!model) {
        fprintf(stderr, "Failed to load model\n");
        return 1;
    }
    
    // Prepare messages
    const char* messages = 
        "[{\"role\": \"user\", \"content\": \"What is 2+2?\"}]";
    
    // Run inference
    char response[4096];
    int result = cactus_complete(
        model,            // model handle
        messages,         // JSON messages array
        response,         // response buffer
        sizeof(response), // buffer size
        NULL,             // options (use defaults)
        NULL,             // tools (none)
        NULL,             // callback (no streaming)
        NULL              // user data
    );
    
    if (result > 0) {
        printf("Response: %s\n", response);
    }
    
    // Cleanup
    cactus_destroy(model);
    return 0;
}

Compile and run:

gcc hello.c -I./cactus -L./build -lcactus -o hello
./hello

Expected output:

{
  "success": true,
  "response": "2+2 equals 4.",
  "function_calls": [],
  "cloud_handoff": false,
  "confidence": 0.92,
  "time_to_first_token_ms": 45.2,
  "total_time_ms": 163.7,
  "prefill_tps": 619.5,
  "decode_tps": 168.4,
  "ram_usage_mb": 167.3,
  "prefill_tokens": 28,
  "decode_tokens": 6,
  "total_tokens": 34
}

Option C: Python SDK Example

Create hello.py:

from cactus import cactus_init, cactus_complete, cactus_destroy
import json

# Initialize model
model = cactus_init("weights/gemma-3-270m-it", None, False)

# Prepare messages
messages = json.dumps([
    {"role": "user", "content": "What is 2+2?"}
])

# Run inference
result_json = cactus_complete(model, messages, None, None, None)
result = json.loads(result_json)

print(f"Response: {result['response']}")
print(f"Time to first token: {result['time_to_first_token_ms']:.1f}ms")
print(f"Decode speed: {result['decode_tps']:.1f} tokens/sec")

# Cleanup
cactus_destroy(model)

Run:

python hello.py

Expected output:

Response: 2+2 equals 4.
Time to first token: 45.2ms
Decode speed: 168.4 tokens/sec

Option D: Swift SDK Example (iOS/macOS)

Create a new Swift file:

import Foundation

// Initialize model
let model = try cactusInit(
    "/path/to/weights/gemma-3-270m-it",
    nil,   // no RAG corpus
    false  // don't cache index
)
defer { cactusDestroy(model) }

// Prepare messages
let messages = #"[{"role":"user","content":"What is 2+2?"}]"#

// Run inference
let resultJson = try cactusComplete(model, messages, nil, nil, nil)

if let data = resultJson.data(using: .utf8),
   let result = try? JSONSerialization.jsonObject(with: data) as? [String: Any] {
    print("Response: \(result["response"] ?? "")")
    print("TTFT: \(result["time_to_first_token_ms"] ?? 0)ms")
}

Option E: Kotlin SDK Example (Android)

import com.cactus.*
import org.json.JSONObject

// Initialize model
val model = cactusInit("/path/to/weights/gemma-3-270m-it", null, false)

// Prepare messages  
val messages = """[{"role":"user","content":"What is 2+2?"}]"""

// Run inference
val resultJson = cactusComplete(model, messages, null, null, null)
val result = JSONObject(resultJson)

println("Response: ${result.getString("response")}")
println("TTFT: ${result.getDouble("time_to_first_token_ms")}ms")

// Cleanup
cactusDestroy(model)

Step 4: Try Streaming Output

For a better user experience, stream tokens as they’re generated:

def on_token(token, token_id):
    print(token, end="", flush=True)

options = json.dumps({"max_tokens": 100})
result = cactus_complete(model, messages, options, None, on_token)
print()  # newline after streaming

Step 5: Try More Models

Vision-Language Model

Run image understanding:

cactus download LiquidAI/LFM2-VL-450M

messages = json.dumps([{
    "role": "user",
    "content": "Describe this image",
    "images": ["/path/to/photo.jpg"]
}])

model = cactus_init("weights/lfm2-vl-450m", None, False)
result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

Speech-to-Text Model

Transcribe audio:

cactus transcribe openai/whisper-small --file audio.wav

Or use the API:

from cactus import cactus_init, cactus_transcribe, cactus_destroy
import json

model = cactus_init("weights/whisper-small", None, False)
result = json.loads(cactus_transcribe(
    model,
    "/path/to/audio.wav",  # audio file
    None,                  # no prompt
    None,                  # default options
    None,                  # no streaming callback
    None                   # no PCM buffer (using file)
))

print(f"Transcription: {result['response']}")
cactus_destroy(model)

Next Steps

Explore the API

Learn about all available functions in the Engine API Reference.

Platform-Specific Setup

See detailed installation for iOS, Android, or Linux.

Advanced Features

Explore Tool Calling, RAG, and Embeddings.

Production Deployment

Learn about quantization options, performance tuning, and cloud handoff.

Common Options

Customize inference with options:

{
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0.95,
  "top_k": 40,
  "stop_sequences": ["<|im_end|>"],
  "confidence_threshold": 0.7
}

Option	Default	Description
`max_tokens`	100	Maximum tokens to generate
`temperature`	0.0	Sampling randomness (0.0 = greedy)
`top_p`	0.0	Nucleus sampling threshold
`top_k`	0	Top-k sampling (0 = disabled)
`stop_sequences`	`[]`	Stop generation on these strings
`confidence_threshold`	0.7	Minimum confidence before cloud handoff

Troubleshooting

Model fails to load

# Check model exists
ls weights/gemma-3-270m-it

# Re-download if corrupted
cactus download google/gemma-3-270m-it --reconvert

Python import error

# Rebuild Python bindings
cactus build --python

# Verify virtual environment is active
which python  # should show ./venv/bin/python

Out of memory

# Use a smaller model
cactus download google/gemma-3-270m-it  # ~150MB RAM

# Or increase quantization
cactus download LiquidAI/LFM2-1.2B --precision INT4

If you encounter a “Failed to initialize model” error, check that:

The model path is correct (use absolute paths)
The model was fully downloaded (check file sizes in weights/)
You have sufficient RAM (at least 500MB free)

Get Help

Need assistance?

API Reference - Complete function documentation
GitHub Issues - Report bugs
Reddit Community - Ask questions
Discord - Real-time chat

Join the community to stay updated on new models, features, and optimizations.

​Quickstart

​Prerequisites

​Step 1: Install Cactus

​Step 2: Download a Model

​Step 3: Run Your First Inference

​Option A: Interactive Playground (CLI)

​Option B: C API Example

​Option C: Python SDK Example

​Option D: Swift SDK Example (iOS/macOS)

​Option E: Kotlin SDK Example (Android)

​Step 4: Try Streaming Output

​Step 5: Try More Models

​Vision-Language Model

​Speech-to-Text Model

​Next Steps

​Common Options

​Troubleshooting

​Model fails to load

​Python import error

​Out of memory

​Get Help

Quickstart

Prerequisites

Step 1: Install Cactus

Step 2: Download a Model

Step 3: Run Your First Inference

Option A: Interactive Playground (CLI)

Option B: C API Example

Option C: Python SDK Example

Option D: Swift SDK Example (iOS/macOS)

Option E: Kotlin SDK Example (Android)

Step 4: Try Streaming Output

Step 5: Try More Models

Vision-Language Model

Speech-to-Text Model

Next Steps

Common Options

Troubleshooting

Model fails to load

Python import error

Out of memory

Get Help