Overview
The transcription API supports:
- Batch transcription from audio files or PCM buffers
- Streaming transcription for real-time audio
- Language detection
- Voice activity detection (VAD)
- Multiple ASR models (Whisper, Moonshine, Parakeet)
cactus_transcribe
Transcribe audio to text.
int cactus_transcribe(
cactus_model_t model,
const char* audio_file_path,
const char* prompt,
char* response_buffer,
size_t buffer_size,
const char* options_json,
cactus_token_callback callback,
void* user_data,
const uint8_t* pcm_buffer,
size_t pcm_buffer_size
);
ASR model handle from cactus_init
Path to WAV file. NULL if using pcm_buffer
Initial decoder prompt (e.g., <|startoftranscript|><|en|><|transcribe|><|notimestamps|>)
Buffer to write JSON response
Optional JSON object with transcription options
Optional streaming callback for partial results
Optional pointer passed to callback
Raw PCM audio (16-bit mono 16kHz). NULL if using audio_file_path
Size of PCM buffer in bytes (must be even)
Number of bytes written to response_buffer on success, -1 on error
Options JSON
{
"temperature": 0.0,
"max_tokens": 448,
"use_vad": true,
"cloud_handoff_threshold": 0.65
}
Sampling temperature (0.0 = greedy decoding recommended)
Maximum tokens per audio chunk
Split audio using voice activity detection
Entropy threshold for cloud handoff (0.0 = disabled)
{
"success": true,
"error": null,
"text": "Hello, how are you today?",
"segments": [],
"time_to_first_token_ms": 38.5,
"total_time_ms": 156.2,
"prefill_tokens_per_second": 1200.0,
"decode_tokens_per_second": 85.3,
"prompt_tokens": 6,
"completion_tokens": 12,
"confidence": 0.96,
"cloud_handoff": false,
"ram_usage_mb": 245.1
}
Average confidence score (1.0 - mean_entropy)
Whether transcription should be retried with cloud ASR
Streaming Transcription
cactus_stream_transcribe_start
Start streaming session.
cactus_stream_transcribe_t cactus_stream_transcribe_start(
cactus_model_t model,
const char* options_json
);
cactus_stream_transcribe_process
Process audio chunk.
int cactus_stream_transcribe_process(
cactus_stream_transcribe_t stream,
const uint8_t* pcm_buffer,
size_t pcm_buffer_size,
char* response_buffer,
size_t buffer_size
);
stream
cactus_stream_transcribe_t
required
Stream handle from cactus_stream_transcribe_start
PCM audio chunk (16-bit mono 16kHz)
cactus_stream_transcribe_stop
Finalize streaming session.
int cactus_stream_transcribe_stop(
cactus_stream_transcribe_t stream,
char* response_buffer,
size_t buffer_size
);
Language Detection
cactus_detect_language
Detect spoken language (Whisper only).
int cactus_detect_language(
cactus_model_t model,
const char* audio_file_path,
char* response_buffer,
size_t buffer_size,
const char* options_json,
const uint8_t* pcm_buffer,
size_t pcm_buffer_size
);
{
"success": true,
"error": null,
"language": "en",
"language_token": "<|en|>",
"token_id": 50259,
"confidence": 0.98,
"entropy": 0.02,
"total_time_ms": 42.1,
"ram_usage_mb": 210.5
}
ISO language code (e.g., en, es, zh)
Detection confidence (1.0 - entropy)
Example: Batch Transcription
#include "cactus_ffi.h"
#include <stdio.h>
int main() {
cactus_model_t model = cactus_init("/path/to/whisper", NULL, false);
const char* prompt = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>";
const char* options = "{\"use_vad\":true}";
char response[16384];
int result = cactus_transcribe(
model,
"/path/to/audio.wav",
prompt,
response,
sizeof(response),
options,
NULL, NULL,
NULL, 0
);
if (result > 0) {
printf("%s\n", response);
} else {
printf("Error: %s\n", cactus_get_last_error());
}
cactus_destroy(model);
}
Example: Streaming
#include "cactus_ffi.h"
#include <stdio.h>
void process_audio_stream(cactus_model_t model) {
cactus_stream_transcribe_t stream = cactus_stream_transcribe_start(
model,
"{}"
);
// Process audio chunks
char response[4096];
while (has_audio_data()) {
uint8_t chunk[4096];
size_t chunk_size = read_audio_chunk(chunk, sizeof(chunk));
int result = cactus_stream_transcribe_process(
stream,
chunk,
chunk_size,
response,
sizeof(response)
);
if (result > 0) {
printf("Partial: %s\n", response);
}
}
// Finalize
cactus_stream_transcribe_stop(stream, response, sizeof(response));
printf("Final: %s\n", response);
}
Example: Language Detection
cactus_model_t model = cactus_init("/path/to/whisper", NULL, false);
char response[2048];
int result = cactus_detect_language(
model,
"/path/to/audio.wav",
response,
sizeof(response),
NULL,
NULL, 0
);
if (result > 0) {
// Parse JSON to extract language field
printf("%s\n", response);
}
cactus_destroy(model);
All audio must be:
- Sample rate: 16 kHz
- Channels: Mono (1 channel)
- Format: 16-bit signed PCM
WAV files are automatically resampled. Raw PCM buffers must already be 16kHz mono.
See Also
VAD API
Voice activity detection
Python SDK
Python transcription API
Transcription Guide
Speech recognition guide