Chat Completion

Overview

The Cactus chat completion API enables you to build conversational AI applications with support for multi-turn conversations, streaming responses, tool calling, and automatic cloud fallback.

Basic Completion

C API

#include <cactus.h>
#include <stdio.h>

cactus_model_t model = cactus_init("weights/lfm2-1.2b", NULL, false);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"}
])";

char response[4096];
int result = cactus_complete(
    model,
    messages,
    response,
    sizeof(response),
    NULL,  // options
    NULL,  // tools
    NULL,  // callback
    NULL   // user_data
);

if (result == 0) {
    printf("%s\n", response);
}

cactus_destroy(model);

Python SDK

from cactus import cactus_init, cactus_complete, cactus_destroy
import json

model = cactus_init("weights/lfm2-1.2b", None, False)

messages = json.dumps([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"}
])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])
print(f"Time to first token: {result['time_to_first_token_ms']:.2f}ms")
print(f"Decode speed: {result['decode_tps']:.2f} tokens/sec")

cactus_destroy(model)

Response Format

All completion responses return a JSON object:

{
    "success": true,
    "error": null,
    "cloud_handoff": false,
    "response": "4",
    "function_calls": [],
    "confidence": 0.92,
    "time_to_first_token_ms": 45.2,
    "total_time_ms": 163.7,
    "prefill_tps": 619.5,
    "decode_tps": 168.4,
    "ram_usage_mb": 245.67,
    "prefill_tokens": 28,
    "decode_tokens": 12,
    "total_tokens": 40
}

success

boolean

Whether the generation succeeded

response

string

The model’s generated text response

cloud_handoff

boolean

True if confidence was below threshold and cloud model was used

confidence

number

Model confidence score (0-1) based on token probabilities

function_calls

array

Parsed tool/function calls if tools were provided

Options

Control generation behavior with an options JSON object:

{
    "max_tokens": 256,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 20,
    "stop_sequences": ["<|im_end|>", "User:"],
    "cloud_handoff_threshold": 0.8
}

max_tokens

integer

default:"512"

Maximum number of tokens to generate

temperature

number

default:"0.6"

Sampling temperature (0.0-2.0). Higher values = more random

top_p

number

default:"0.95"

Nucleus sampling threshold (0.0-1.0)

top_k

integer

default:"20"

Top-k sampling limit. 0 disables top-k

stop_sequences

array

Array of strings that stop generation when encountered

cloud_handoff_threshold

number

default:"0.0"

Minimum confidence (0-1) required to stay on-device. Below this triggers cloud fallback

Streaming Responses

Get token-by-token streaming for better UX:

Python Example

def on_token(token, token_id):
    print(token, end="", flush=True)

options = json.dumps({"max_tokens": 256, "temperature": 0.7})
result = json.loads(cactus_complete(model, messages, options, None, on_token))
print(f"\n\nGeneration complete: {result['total_time_ms']:.2f}ms")

Swift Example

let options = #"{"max_tokens":256,"temperature":0.7}"#

let result = try cactusComplete(model, messagesJson, options, nil) { token, tokenId in
    print(token, terminator: "")
}

Kotlin Example

val options = """{"max_tokens":256,"temperature":0.7}"""

val result = cactusComplete(model, messagesJson, options, null) { token, _ ->
    print(token)
}

Multi-Turn Conversations

Maintain conversation history by including previous messages:

conversation = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."},
    {"role": "user", "content": "What about 3+3?"}
]

messages = json.dumps(conversation)
result = json.loads(cactus_complete(model, messages, None, None, None))
conversation.append({"role": "assistant", "content": result["response"]})

The model’s KV cache is automatically managed. Use cactus_reset(model) to clear the cache and start a fresh conversation.

Cloud Fallback

Automatically hand off complex queries to cloud models:

options = json.dumps({
    "max_tokens": 512,
    "cloud_handoff_threshold": 0.8  # Hand off if confidence < 0.8
})

result = json.loads(cactus_complete(model, messages, options, None, None))

if result["cloud_handoff"]:
    print("Query handled by cloud model")
else:
    print(f"On-device inference (confidence: {result['confidence']:.2f})")

Set your Cactus Cloud API key with cactus auth to enable cloud fallback.

Error Handling

try:
    result = json.loads(cactus_complete(model, messages, None, None, None))
    if not result["success"]:
        print(f"Generation failed: {result['error']}")
except RuntimeError as e:
    print(f"API error: {e}")
    error = cactus_get_last_error()
    if error:
        print(f"Details: {error}")

Next Steps

Tool Calling

Add function calling for agentic workflows

Vision Models

Use vision-language models with image inputs

API Reference

Complete API documentation

Streaming Guide

Advanced streaming patterns

​Overview

​Basic Completion

​C API

​Python SDK

​Response Format

​Options

​Streaming Responses

​Python Example

​Swift Example

​Kotlin Example

​Multi-Turn Conversations

​Cloud Fallback

​Error Handling

​Next Steps

Tool Calling

Vision Models

API Reference

Streaming Guide

Overview

Basic Completion

C API

Python SDK

Response Format

Options

Streaming Responses

Python Example

Swift Example

Kotlin Example

Multi-Turn Conversations

Cloud Fallback

Error Handling

Next Steps