Performance Tuning

Cactus provides several configuration options to optimize inference performance for your specific use case. This guide covers KV cache management, chunked prefill, TPS throttling, memory optimization, and benchmarking.

KV Cache Configuration

The Key-Value (KV) cache stores attention states from previous tokens, enabling fast autoregressive generation. Proper cache configuration balances memory usage and context length.

Sliding Window Cache

By default, Cactus uses a sliding window cache that keeps recent tokens + a small “sink” of early tokens.

#include "cactus/engine/engine.h"

cactus::Engine* engine = /* ... load model ... */;

// Set sliding window: 2048 recent tokens + 4 sink tokens
engine->set_cache_window(2048, 4);

Parameters:

window_size — Maximum recent tokens to cache (e.g., 1024, 2048, 4096)
sink_size — Number of initial tokens to always keep (default: 4)

Trade-offs:

Window Size	Memory Usage	Context Length	Use Case
512	Low (~50MB)	Short	Chatbots, quick Q&A
1024	Medium (~100MB)	Medium	Most applications
2048	High (~200MB)	Long	Document analysis
4096	Very High (~400MB)	Very Long	RAG, long-form generation

Memory scales with model size: The above estimates are for ~1B param models. Larger models use proportionally more cache memory.

Resetting Cache

// Clear KV cache (e.g., between conversations)
engine->reset_cache();

Call reset_cache() when:

Starting a new conversation
Switching contexts
Memory pressure is high

Cache Quantization

Cactus automatically quantizes KV cache to INT8, providing 2x memory reduction with minimal quality loss. Automatic since v1.7 (Oct 2025) — No configuration needed!

Chunked Prefill

Chunked prefill processes long prompts in chunks rather than all at once, reducing memory spikes and improving time-to-first-token on long contexts.

Configuring Chunk Size

// Process prefill in chunks of 256 tokens
std::vector<uint32_t> tokens = tokenizer->encode("Long prompt...");
engine->prefill(tokens, 256);  // chunk_size = 256

Default: 256 tokens (optimal for most models) Chunk Size Guidelines:

Chunk Size	Memory	Latency	Throughput	Best For
128	Lowest	Higher	Lower	Budget devices, limited RAM
256	Low	Medium	High	Default — most cases
512	Medium	Low	Higher	High-end devices, speed priority
1024+	High	Lowest	Highest	Desktop/Mac only

NPU prefill uses fixed chunk size: If NPU acceleration is enabled, the chunk size is determined by the NPU model architecture (typically 256) and cannot be changed at runtime. The chunk_size parameter is ignored.

Prefill Performance

From the Cactus README benchmarks: LFM 1.2B (1k-prefill / 100-decode, values are prefill TPS / decode TPS):

Device	Prefill TPS	Decode TPS
Mac M4 Pro	582 tok/s	100 tok/s
iPad/Mac M3	350 tok/s	60 tok/s
iPhone 17 Pro	327 tok/s	48 tok/s
iPhone 13 Mini	148 tok/s	34 tok/s
Galaxy S25 Ultra	255 tok/s	37 tok/s
Pixel 6a	70 tok/s	15 tok/s
Raspberry Pi 5	69 tok/s	11 tok/s

Tips:

Prefill is bottleneck for long prompts — Use chunked prefill
NPU acceleration — 5-11x faster prefill on Apple devices (v1.15+)
Cactus Attention — Makes long prefill as fast as short (v1.9+)

TPS Throttling

Limit maximum tokens-per-second to reduce power consumption and thermal throttling.

Setting Max TPS

// Limit to 30 tokens/second
const char* options = R"({
    "max_tps": 30.0
})";

cactus_complete(model, messages, response, sizeof(response), 
                options, nullptr, nullptr, nullptr);

Use cases:

Streaming UI — Match typing speed (~20-30 TPS feels natural)
Power saving — Reduce battery drain on mobile
Thermal management — Prevent device overheating during long sessions

Trade-offs:

Max TPS	Power Usage	User Experience
Unlimited	Highest	Fastest response
50	High	Still feels instant
30	Medium	Smooth typing speed
20	Low	Comfortable reading pace
10	Lowest	Noticeably slow

Set max_tps: -1.0 (default) for unlimited speed. The engine will generate as fast as hardware allows.

Memory Optimization

Reducing Memory Footprint

Use INT4 quantization
```
cactus convert Qwen/Qwen3-0.6B ./model --precision INT4
```
- 50% smaller weights vs INT8
- Minimal quality loss

Reduce KV cache window

engine->set_cache_window(512, 4);  // 512 tokens instead of 2048

4x less cache memory
Shorter effective context

Use smaller models
- Gemma3-270m: ~120MB RAM
- Qwen3-0.6B: ~200MB RAM
- LFM2.5-1.2B: ~400MB RAM

Free memory between sessions

engine->reset_cache();  // Clear KV cache
// Or destroy and recreate model
cactus_destroy(model);
model = cactus_init(model_path, nullptr, false);

Memory Benchmarks

From the Cactus README benchmarks: INT4 Models:

Model	iPhone 17 Pro	Galaxy S25 Ultra	Raspberry Pi 5
LFM 1.2B	108MB	1.5GB	869MB
LFMVL 1.6B	108MB	1.5GB	869MB
Parakeet 1.1B	108MB	1.5GB	869MB

Android uses more RAM: Android’s memory overhead is higher than iOS for the same model. This is due to JNI/JVM overhead and different memory allocators. Use smaller models or INT4 quantization on Android.

Monitoring Memory Usage

// Memory usage included in completion response
const char* response = cactus_complete(/* ... */);

// Parse JSON response
// {
//   "ram_usage_mb": 245.67,
//   "prefill_tokens": 28,
//   "decode_tokens": 50,
//   ...
// }

Benchmarking

Use the --benchmark flag to measure performance on your specific hardware.

Command-Line Benchmarks

# Benchmark with default model
cactus test --benchmark

# Benchmark specific model
cactus test --model ./my-model --benchmark

# Benchmark on connected iPhone
cactus test --model ./my-model --benchmark --ios

# Benchmark on connected Android
cactus test --model ./my-model --benchmark --android

Metrics Reported

{
  "success": true,
  "response": "Generated text...",
  "confidence": 0.8193,
  "time_to_first_token_ms": 45.23,
  "total_time_ms": 163.67,
  "prefill_tps": 1621.89,
  "decode_tps": 168.42,
  "ram_usage_mb": 245.67,
  "prefill_tokens": 28,
  "decode_tokens": 50,
  "total_tokens": 78
}

Key metrics:

time_to_first_token_ms — How long until first token appears (latency)
prefill_tps — Prompt processing speed (tokens/sec)
decode_tps — Generation speed (tokens/sec)
ram_usage_mb — Peak memory usage

Performance Profiling

// Profile prefill with trace file
std::vector<uint32_t> tokens = tokenizer->encode("Prompt");
engine->prefill(tokens, 256, "prefill_trace.json");

// Analyze trace file to identify bottlenecks

The trace file contains:

Per-layer timings
Memory allocations
NPU vs CPU split
Cache hit rates

Optimization Recipes

For Speed (High-End Devices)

// Large cache window for long context
engine->set_cache_window(4096, 4);

// Large prefill chunks
engine->prefill(tokens, 512);

// No TPS limit
const char* options = R"({"max_tps": -1.0})";

// Use INT8 for quality
// cactus convert <model> --precision INT8

For Memory (Budget Devices)

// Small cache window
engine->set_cache_window(512, 4);

// Small prefill chunks
engine->prefill(tokens, 128);

// Use INT4 quantization
// cactus convert <model> --precision INT4

// Consider smaller base model (Gemma3-270m, Qwen3-0.6B)

For Battery Life (Mobile)

// Moderate cache
engine->set_cache_window(1024, 4);

// Throttle TPS to reduce power
const char* options = R"({"max_tps": 30.0})";

// NPU acceleration (iOS/macOS)
engine->load_npu_prefill("./model/npu_prefill.mlmodelc");

// INT4 quantization for less memory bandwidth

For Quality (Critical Applications)

// Large cache for full context
engine->set_cache_window(4096, 4);

// INT8 or FP16 quantization
// cactus convert <model> --precision INT8

// No TPS throttling
const char* options = R"({"max_tps": -1.0})";

// Larger base model (LFM2.5-1.2B, Qwen3-1.7B)

Advanced Techniques

Cactus Attention (v1.9+)

Automatic optimization that makes long prefill as fast as short prefill for decode. Enabled by default — No configuration needed! Impact: 10-token prefill and 1k-token prefill have the same decode speed.

Hybrid Inference (v1.15+)

Automatic blend of NPU (prefill) and CPU (decode) for optimal performance. Enabled automatically on devices with compatible NPUs (Apple Neural Engine, coming to Qualcomm/MediaTek). Impact: 5-11x faster prefill on iOS/macOS.

Lossless Quantization (v1.15+)

Advanced quantization techniques that maintain quality while improving speed. Enabled by default with INT4/INT8 quantization. Impact: 1.5x speedup vs naive quantization.

​Performance Tuning

​KV Cache Configuration

​Sliding Window Cache

​Resetting Cache

​Cache Quantization

​Chunked Prefill

​Configuring Chunk Size

​Prefill Performance

​TPS Throttling

​Setting Max TPS

​Memory Optimization

​Reducing Memory Footprint

​Memory Benchmarks

​Monitoring Memory Usage

​Benchmarking

​Command-Line Benchmarks

​Metrics Reported

​Performance Profiling

​Optimization Recipes

​For Speed (High-End Devices)

​For Memory (Budget Devices)

​For Battery Life (Mobile)

​For Quality (Critical Applications)

​Advanced Techniques

​Cactus Attention (v1.9+)

​Hybrid Inference (v1.15+)

​Lossless Quantization (v1.15+)

​See Also

Performance Tuning

KV Cache Configuration

Sliding Window Cache

Resetting Cache

Cache Quantization

Chunked Prefill

Configuring Chunk Size

Prefill Performance

TPS Throttling

Setting Max TPS

Memory Optimization

Reducing Memory Footprint

Memory Benchmarks

Monitoring Memory Usage

Benchmarking

Command-Line Benchmarks

Metrics Reported

Performance Profiling

Optimization Recipes

For Speed (High-End Devices)

For Memory (Budget Devices)

For Battery Life (Mobile)

For Quality (Critical Applications)

Advanced Techniques

Cactus Attention (v1.9+)

Hybrid Inference (v1.15+)

Lossless Quantization (v1.15+)

See Also