Skip to main content

Performance Tuning

Cactus provides several configuration options to optimize inference performance for your specific use case. This guide covers KV cache management, chunked prefill, TPS throttling, memory optimization, and benchmarking.

KV Cache Configuration

The Key-Value (KV) cache stores attention states from previous tokens, enabling fast autoregressive generation. Proper cache configuration balances memory usage and context length.

Sliding Window Cache

By default, Cactus uses a sliding window cache that keeps recent tokens + a small “sink” of early tokens.
#include "cactus/engine/engine.h"

cactus::Engine* engine = /* ... load model ... */;

// Set sliding window: 2048 recent tokens + 4 sink tokens
engine->set_cache_window(2048, 4);
Parameters:
  • window_size — Maximum recent tokens to cache (e.g., 1024, 2048, 4096)
  • sink_size — Number of initial tokens to always keep (default: 4)
Trade-offs:
Window SizeMemory UsageContext LengthUse Case
512Low (~50MB)ShortChatbots, quick Q&A
1024Medium (~100MB)MediumMost applications
2048High (~200MB)LongDocument analysis
4096Very High (~400MB)Very LongRAG, long-form generation
Memory scales with model size: The above estimates are for ~1B param models. Larger models use proportionally more cache memory.

Resetting Cache

// Clear KV cache (e.g., between conversations)
engine->reset_cache();
Call reset_cache() when:
  • Starting a new conversation
  • Switching contexts
  • Memory pressure is high

Cache Quantization

Cactus automatically quantizes KV cache to INT8, providing 2x memory reduction with minimal quality loss. Automatic since v1.7 (Oct 2025) — No configuration needed!

Chunked Prefill

Chunked prefill processes long prompts in chunks rather than all at once, reducing memory spikes and improving time-to-first-token on long contexts.

Configuring Chunk Size

// Process prefill in chunks of 256 tokens
std::vector<uint32_t> tokens = tokenizer->encode("Long prompt...");
engine->prefill(tokens, 256);  // chunk_size = 256
Default: 256 tokens (optimal for most models) Chunk Size Guidelines:
Chunk SizeMemoryLatencyThroughputBest For
128LowestHigherLowerBudget devices, limited RAM
256LowMediumHighDefault — most cases
512MediumLowHigherHigh-end devices, speed priority
1024+HighLowestHighestDesktop/Mac only
NPU prefill uses fixed chunk size: If NPU acceleration is enabled, the chunk size is determined by the NPU model architecture (typically 256) and cannot be changed at runtime. The chunk_size parameter is ignored.

Prefill Performance

From the Cactus README benchmarks: LFM 1.2B (1k-prefill / 100-decode, values are prefill TPS / decode TPS):
DevicePrefill TPSDecode TPS
Mac M4 Pro582 tok/s100 tok/s
iPad/Mac M3350 tok/s60 tok/s
iPhone 17 Pro327 tok/s48 tok/s
iPhone 13 Mini148 tok/s34 tok/s
Galaxy S25 Ultra255 tok/s37 tok/s
Pixel 6a70 tok/s15 tok/s
Raspberry Pi 569 tok/s11 tok/s
Tips:
  • Prefill is bottleneck for long prompts — Use chunked prefill
  • NPU acceleration — 5-11x faster prefill on Apple devices (v1.15+)
  • Cactus Attention — Makes long prefill as fast as short (v1.9+)

TPS Throttling

Limit maximum tokens-per-second to reduce power consumption and thermal throttling.

Setting Max TPS

// Limit to 30 tokens/second
const char* options = R"({
    "max_tps": 30.0
})";

cactus_complete(model, messages, response, sizeof(response), 
                options, nullptr, nullptr, nullptr);
Use cases:
  • Streaming UI — Match typing speed (~20-30 TPS feels natural)
  • Power saving — Reduce battery drain on mobile
  • Thermal management — Prevent device overheating during long sessions
Trade-offs:
Max TPSPower UsageUser Experience
UnlimitedHighestFastest response
50HighStill feels instant
30MediumSmooth typing speed
20LowComfortable reading pace
10LowestNoticeably slow
Set max_tps: -1.0 (default) for unlimited speed. The engine will generate as fast as hardware allows.

Memory Optimization

Reducing Memory Footprint

  1. Use INT4 quantization
    cactus convert Qwen/Qwen3-0.6B ./model --precision INT4
    
    • 50% smaller weights vs INT8
    • Minimal quality loss
  2. Reduce KV cache window
    engine->set_cache_window(512, 4);  // 512 tokens instead of 2048
    
    • 4x less cache memory
    • Shorter effective context
  3. Use smaller models
    • Gemma3-270m: ~120MB RAM
    • Qwen3-0.6B: ~200MB RAM
    • LFM2.5-1.2B: ~400MB RAM
  4. Free memory between sessions
    engine->reset_cache();  // Clear KV cache
    // Or destroy and recreate model
    cactus_destroy(model);
    model = cactus_init(model_path, nullptr, false);
    

Memory Benchmarks

From the Cactus README benchmarks: INT4 Models:
ModeliPhone 17 ProGalaxy S25 UltraRaspberry Pi 5
LFM 1.2B108MB1.5GB869MB
LFMVL 1.6B108MB1.5GB869MB
Parakeet 1.1B108MB1.5GB869MB
Android uses more RAM: Android’s memory overhead is higher than iOS for the same model. This is due to JNI/JVM overhead and different memory allocators. Use smaller models or INT4 quantization on Android.

Monitoring Memory Usage

// Memory usage included in completion response
const char* response = cactus_complete(/* ... */);

// Parse JSON response
// {
//   "ram_usage_mb": 245.67,
//   "prefill_tokens": 28,
//   "decode_tokens": 50,
//   ...
// }

Benchmarking

Use the --benchmark flag to measure performance on your specific hardware.

Command-Line Benchmarks

# Benchmark with default model
cactus test --benchmark

# Benchmark specific model
cactus test --model ./my-model --benchmark

# Benchmark on connected iPhone
cactus test --model ./my-model --benchmark --ios

# Benchmark on connected Android
cactus test --model ./my-model --benchmark --android

Metrics Reported

{
  "success": true,
  "response": "Generated text...",
  "confidence": 0.8193,
  "time_to_first_token_ms": 45.23,
  "total_time_ms": 163.67,
  "prefill_tps": 1621.89,
  "decode_tps": 168.42,
  "ram_usage_mb": 245.67,
  "prefill_tokens": 28,
  "decode_tokens": 50,
  "total_tokens": 78
}
Key metrics:
  • time_to_first_token_ms — How long until first token appears (latency)
  • prefill_tps — Prompt processing speed (tokens/sec)
  • decode_tps — Generation speed (tokens/sec)
  • ram_usage_mb — Peak memory usage

Performance Profiling

// Profile prefill with trace file
std::vector<uint32_t> tokens = tokenizer->encode("Prompt");
engine->prefill(tokens, 256, "prefill_trace.json");

// Analyze trace file to identify bottlenecks
The trace file contains:
  • Per-layer timings
  • Memory allocations
  • NPU vs CPU split
  • Cache hit rates

Optimization Recipes

For Speed (High-End Devices)

// Large cache window for long context
engine->set_cache_window(4096, 4);

// Large prefill chunks
engine->prefill(tokens, 512);

// No TPS limit
const char* options = R"({"max_tps": -1.0})";

// Use INT8 for quality
// cactus convert <model> --precision INT8

For Memory (Budget Devices)

// Small cache window
engine->set_cache_window(512, 4);

// Small prefill chunks
engine->prefill(tokens, 128);

// Use INT4 quantization
// cactus convert <model> --precision INT4

// Consider smaller base model (Gemma3-270m, Qwen3-0.6B)

For Battery Life (Mobile)

// Moderate cache
engine->set_cache_window(1024, 4);

// Throttle TPS to reduce power
const char* options = R"({"max_tps": 30.0})";

// NPU acceleration (iOS/macOS)
engine->load_npu_prefill("./model/npu_prefill.mlmodelc");

// INT4 quantization for less memory bandwidth

For Quality (Critical Applications)

// Large cache for full context
engine->set_cache_window(4096, 4);

// INT8 or FP16 quantization
// cactus convert <model> --precision INT8

// No TPS throttling
const char* options = R"({"max_tps": -1.0})";

// Larger base model (LFM2.5-1.2B, Qwen3-1.7B)

Advanced Techniques

Cactus Attention (v1.9+)

Automatic optimization that makes long prefill as fast as short prefill for decode. Enabled by default — No configuration needed! Impact: 10-token prefill and 1k-token prefill have the same decode speed.

Hybrid Inference (v1.15+)

Automatic blend of NPU (prefill) and CPU (decode) for optimal performance. Enabled automatically on devices with compatible NPUs (Apple Neural Engine, coming to Qualcomm/MediaTek). Impact: 5-11x faster prefill on iOS/macOS.

Lossless Quantization (v1.15+)

Advanced quantization techniques that maintain quality while improving speed. Enabled by default with INT4/INT8 quantization. Impact: 1.5x speedup vs naive quantization.

See Also