Transcription - Cactus

Overview

The transcription API supports:

Batch transcription from audio files or PCM buffers
Streaming transcription for real-time audio
Language detection
Voice activity detection (VAD)
Multiple ASR models (Whisper, Moonshine, Parakeet)

cactus_transcribe

Transcribe audio to text.

int cactus_transcribe(
    cactus_model_t model,
    const char* audio_file_path,
    const char* prompt,
    char* response_buffer,
    size_t buffer_size,
    const char* options_json,
    cactus_token_callback callback,
    void* user_data,
    const uint8_t* pcm_buffer,
    size_t pcm_buffer_size
);

model

cactus_model_t

required

ASR model handle from cactus_init

audio_file_path

string

Path to WAV file. NULL if using pcm_buffer

prompt

string

required

Initial decoder prompt (e.g., <|startoftranscript|><|en|><|transcribe|><|notimestamps|>)

response_buffer

char*

required

Buffer to write JSON response

buffer_size

size_t

required

Size of response buffer

options_json

string

Optional JSON object with transcription options

callback

cactus_token_callback

Optional streaming callback for partial results

user_data

void*

Optional pointer passed to callback

pcm_buffer

uint8_t*

Raw PCM audio (16-bit mono 16kHz). NULL if using audio_file_path

pcm_buffer_size

size_t

Size of PCM buffer in bytes (must be even)

return

int

Number of bytes written to response_buffer on success, -1 on error

Options JSON

{
  "temperature": 0.0,
  "max_tokens": 448,
  "use_vad": true,
  "cloud_handoff_threshold": 0.65
}

temperature

float

default:"0.0"

Sampling temperature (0.0 = greedy decoding recommended)

max_tokens

int

default:"448"

Maximum tokens per audio chunk

use_vad

bool

default:"false"

Split audio using voice activity detection

cloud_handoff_threshold

float

default:"0.0"

Entropy threshold for cloud handoff (0.0 = disabled)

Response Format

{
  "success": true,
  "error": null,
  "text": "Hello, how are you today?",
  "segments": [],
  "time_to_first_token_ms": 38.5,
  "total_time_ms": 156.2,
  "prefill_tokens_per_second": 1200.0,
  "decode_tokens_per_second": 85.3,
  "prompt_tokens": 6,
  "completion_tokens": 12,
  "confidence": 0.96,
  "cloud_handoff": false,
  "ram_usage_mb": 245.1
}

text

string

Transcribed text

confidence

float

Average confidence score (1.0 - mean_entropy)

cloud_handoff

bool

Whether transcription should be retried with cloud ASR

Streaming Transcription

cactus_stream_transcribe_start

Start streaming session.

cactus_stream_transcribe_t cactus_stream_transcribe_start(
    cactus_model_t model,
    const char* options_json
);

cactus_stream_transcribe_process

Process audio chunk.

int cactus_stream_transcribe_process(
    cactus_stream_transcribe_t stream,
    const uint8_t* pcm_buffer,
    size_t pcm_buffer_size,
    char* response_buffer,
    size_t buffer_size
);

stream

cactus_stream_transcribe_t

required

Stream handle from cactus_stream_transcribe_start

pcm_buffer

uint8_t*

required

PCM audio chunk (16-bit mono 16kHz)

pcm_buffer_size

size_t

required

Chunk size in bytes

cactus_stream_transcribe_stop

Finalize streaming session.

int cactus_stream_transcribe_stop(
    cactus_stream_transcribe_t stream,
    char* response_buffer,
    size_t buffer_size
);

Language Detection

cactus_detect_language

Detect spoken language (Whisper only).

int cactus_detect_language(
    cactus_model_t model,
    const char* audio_file_path,
    char* response_buffer,
    size_t buffer_size,
    const char* options_json,
    const uint8_t* pcm_buffer,
    size_t pcm_buffer_size
);

Response Format

{
  "success": true,
  "error": null,
  "language": "en",
  "language_token": "<|en|>",
  "token_id": 50259,
  "confidence": 0.98,
  "entropy": 0.02,
  "total_time_ms": 42.1,
  "ram_usage_mb": 210.5
}

language

string

ISO language code (e.g., en, es, zh)

language_token

string

Whisper language token

confidence

float

Detection confidence (1.0 - entropy)

Example: Batch Transcription

#include "cactus_ffi.h"
#include <stdio.h>

int main() {
    cactus_model_t model = cactus_init("/path/to/whisper", NULL, false);
    
    const char* prompt = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>";
    const char* options = "{\"use_vad\":true}";
    
    char response[16384];
    int result = cactus_transcribe(
        model,
        "/path/to/audio.wav",
        prompt,
        response,
        sizeof(response),
        options,
        NULL, NULL,
        NULL, 0
    );
    
    if (result > 0) {
        printf("%s\n", response);
    } else {
        printf("Error: %s\n", cactus_get_last_error());
    }
    
    cactus_destroy(model);
}

Example: Streaming

#include "cactus_ffi.h"
#include <stdio.h>

void process_audio_stream(cactus_model_t model) {
    cactus_stream_transcribe_t stream = cactus_stream_transcribe_start(
        model,
        "{}"
    );
    
    // Process audio chunks
    char response[4096];
    while (has_audio_data()) {
        uint8_t chunk[4096];
        size_t chunk_size = read_audio_chunk(chunk, sizeof(chunk));
        
        int result = cactus_stream_transcribe_process(
            stream,
            chunk,
            chunk_size,
            response,
            sizeof(response)
        );
        
        if (result > 0) {
            printf("Partial: %s\n", response);
        }
    }
    
    // Finalize
    cactus_stream_transcribe_stop(stream, response, sizeof(response));
    printf("Final: %s\n", response);
}

Example: Language Detection

cactus_model_t model = cactus_init("/path/to/whisper", NULL, false);

char response[2048];
int result = cactus_detect_language(
    model,
    "/path/to/audio.wav",
    response,
    sizeof(response),
    NULL,
    NULL, 0
);

if (result > 0) {
    // Parse JSON to extract language field
    printf("%s\n", response);
}

cactus_destroy(model);

Audio Format Requirements

All audio must be:

Sample rate: 16 kHz
Channels: Mono (1 channel)
Format: 16-bit signed PCM

WAV files are automatically resampled. Raw PCM buffers must already be 16kHz mono.

VAD API

Voice activity detection

Python SDK

Python transcription API

Transcription Guide

Speech recognition guide

​Overview

​cactus_transcribe

​Options JSON

​Response Format

​Streaming Transcription

​cactus_stream_transcribe_start

​cactus_stream_transcribe_process

​cactus_stream_transcribe_stop

​Language Detection

​cactus_detect_language

​Response Format

​Example: Batch Transcription

​Example: Streaming

​Example: Language Detection

​Audio Format Requirements

​See Also

VAD API

Python SDK

Transcription Guide

Overview

cactus_transcribe

Options JSON

Response Format

Streaming Transcription

cactus_stream_transcribe_start

cactus_stream_transcribe_process

cactus_stream_transcribe_stop

Language Detection

cactus_detect_language

Response Format

Example: Batch Transcription

Example: Streaming

Example: Language Detection

Audio Format Requirements

See Also