Skip to main content

Overview

Streaming enables token-by-token output as the model generates text, providing a better user experience for interactive applications.

Basic Streaming

Python

from cactus import cactus_init, cactus_complete, cactus_destroy
import json

model = cactus_init("weights/lfm2-1.2b", None, False)

def on_token(token, token_id):
    print(token, end="", flush=True)

messages = json.dumps([{"role": "user", "content": "Tell me a story"}])
result = json.loads(cactus_complete(model, messages, None, None, on_token))

print(f"\n\nGenerated {result['decode_tokens']} tokens in {result['total_time_ms']:.2f}ms")
cactus_destroy(model)

Swift

let result = try cactusComplete(model, messagesJson, nil, nil) { token, tokenId in
    print(token, terminator: "")
}

Kotlin

val result = cactusComplete(model, messagesJson, null, null) { token, _ ->
    print(token)
}

Streaming Transcription

Stream audio transcription results in real-time:
from cactus import (
    cactus_stream_transcribe_start,
    cactus_stream_transcribe_process,
    cactus_stream_transcribe_stop
)
import json

stream = cactus_stream_transcribe_start(model, None)

for audio_chunk in microphone_stream():
    partial = json.loads(cactus_stream_transcribe_process(stream, audio_chunk))
    print(f"\r{partial['text']}", end="")

final = json.loads(cactus_stream_transcribe_stop(stream))
print(f"\nFinal: {final['text']}")

Buffering Strategies

def on_token(token, token_id):
    print(token, end="", flush=True)

Error Handling

try:
    result = json.loads(cactus_complete(model, messages, None, None, on_token))
    if not result["success"]:
        print(f"\nGeneration failed: {result['error']}")
except RuntimeError as e:
    print(f"\nStream error: {e}")

Stop Generation

from cactus import cactus_stop

# In another thread
cactus_stop(model)  # Aborts ongoing generation

Next Steps

Chat Completion

Build conversational AI

Transcription

Real-time speech-to-text