Skip to main content

Overview

The completion API supports:
  • Multi-turn conversations with chat templates
  • Tool calling (function calling)
  • Streaming token callbacks
  • Vision-language models (images in messages)
  • Retrieval-augmented generation (RAG)
  • Cloud handoff for low-confidence responses

cactus_complete

Generate chat completion.
int cactus_complete(
    cactus_model_t model,
    const char* messages_json,
    char* response_buffer,
    size_t buffer_size,
    const char* options_json,
    const char* tools_json,
    cactus_token_callback callback,
    void* user_data
);
model
cactus_model_t
required
Model handle from cactus_init
messages_json
string
required
JSON array of message objects (see format below)
response_buffer
char*
required
Buffer to write JSON response
buffer_size
size_t
required
Size of response buffer in bytes
options_json
string
Optional JSON object with generation parameters
tools_json
string
Optional JSON array of tool definitions
callback
cactus_token_callback
Optional streaming callback: void callback(const char* token, uint32_t token_id, void* user_data)
user_data
void*
Optional pointer passed to callback
return
int
Number of bytes written to response_buffer on success, -1 on error

Messages Format

[
  {
    "role": "system",
    "content": "You are a helpful assistant."
  },
  {
    "role": "user",
    "content": "What is the capital of France?"
  },
  {
    "role": "assistant",
    "content": "The capital of France is Paris."
  },
  {
    "role": "user",
    "content": "What is its population?",
    "images": ["file:///path/to/map.jpg"]
  }
]
role
string
required
Message role: system, user, assistant, or tool
content
string
required
Message text content
name
string
Speaker name or tool name
images
array<string>
Image file paths or URLs (VLM models only)

Options JSON

{
  "temperature": 0.7,
  "top_p": 0.95,
  "top_k": 40,
  "max_tokens": 2048,
  "stop": ["\n\n", "User:"],
  "include_stop_sequences": false,
  "force_tools": false,
  "tool_rag_top_k": 5,
  "confidence_threshold": 0.5,
  "cloud_handoff_threshold": 0.0
}
temperature
float
default:"0.6"
Sampling temperature (0.0 = greedy, higher = more random)
top_p
float
default:"0.95"
Nucleus sampling threshold
top_k
int
default:"20"
Sample from top K tokens
max_tokens
int
default:"2048"
Maximum tokens to generate
stop
array<string>
Stop sequences to end generation
include_stop_sequences
bool
default:"false"
Include stop sequence in output
force_tools
bool
default:"false"
Constrain output to valid tool calls
tool_rag_top_k
int
default:"5"
Number of RAG documents to retrieve
confidence_threshold
float
default:"0.5"
Minimum confidence for accepting response
cloud_handoff_threshold
float
default:"0.0"
Entropy threshold to trigger cloud handoff (0.0 = disabled)

Tools JSON

[
  {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {
          "type": "string",
          "description": "City name"
        },
        "units": {
          "type": "string",
          "enum": ["celsius", "fahrenheit"]
        }
      },
      "required": ["location"]
    }
  }
]

Response Format

Success Response

{
  "success": true,
  "error": null,
  "text": "Paris has a population of approximately 2.1 million.",
  "stop_reason": "stop",
  "tool_calls": [],
  "time_to_first_token_ms": 45.2,
  "total_time_ms": 523.1,
  "prefill_tokens_per_second": 890.3,
  "decode_tokens_per_second": 42.7,
  "prompt_tokens": 128,
  "completion_tokens": 15,
  "confidence": 0.94,
  "cloud_handoff": false,
  "ram_usage_mb": 412.5
}
success
bool
Whether generation succeeded
error
string | null
Error message if failed
text
string
Generated text
stop_reason
string
Why generation stopped: stop, length, tool_call
tool_calls
array
Parsed tool invocations (if applicable)
time_to_first_token_ms
float
Latency to first generated token
total_time_ms
float
Total generation time
prefill_tokens_per_second
float
Prompt processing throughput
decode_tokens_per_second
float
Token generation throughput
prompt_tokens
int
Number of input tokens
completion_tokens
int
Number of generated tokens
confidence
float
Average confidence score (0.0-1.0)
cloud_handoff
bool
Whether response should be retried in cloud
ram_usage_mb
float
Current memory usage

Error Response

{
  "success": false,
  "error": "Model not initialized",
  "text": "",
  "stop_reason": "error"
}

Example: Basic Completion

#include "cactus_ffi.h"
#include <stdio.h>

int main() {
    cactus_model_t model = cactus_init("/path/to/model", NULL, false);
    
    const char* messages = "["
        "{\"role\":\"user\",\"content\":\"Hello!\"}"
    "]";
    
    char response[8192];
    int result = cactus_complete(
        model,
        messages,
        response,
        sizeof(response),
        NULL,  // default options
        NULL,  // no tools
        NULL,  // no streaming
        NULL
    );
    
    if (result > 0) {
        printf("%s\n", response);
    }
    
    cactus_destroy(model);
    return 0;
}

Example: Streaming

void token_handler(const char* token, uint32_t token_id, void* user_data) {
    printf("%s", token);
    fflush(stdout);
}

int main() {
    cactus_model_t model = cactus_init("/path/to/model", NULL, false);
    
    const char* messages = "[{\"role\":\"user\",\"content\":\"Tell a story.\"}]";
    const char* options = "{\"temperature\":0.8,\"max_tokens\":500}";
    
    char response[8192];
    cactus_complete(
        model,
        messages,
        response,
        sizeof(response),
        options,
        NULL,
        token_handler,
        NULL
    );
    
    cactus_destroy(model);
}

Example: Tool Calling

const char* tools = "["
    "{"
        "\"name\":\"get_weather\","
        "\"description\":\"Get weather for a city\","
        "\"parameters\":{"
            "\"type\":\"object\","
            "\"properties\":{"
                "\"location\":{\"type\":\"string\"}"
            "},"
            "\"required\":[\"location\"]"
        "}"
    "}"
"]";

const char* messages = "[{\"role\":\"user\",\"content\":\"What's the weather in Tokyo?\"}]";
const char* options = "{\"force_tools\":true}";

char response[8192];
cactus_complete(model, messages, response, sizeof(response), options, tools, NULL, NULL);

// Response will contain tool_calls array

See Also

C FFI

Complete FFI reference

Python SDK

Python completion API

Chat Guide

Building chat applications

Tool Calling

Function calling guide