Quantization
Quantization reduces model size and improves inference speed by representing weights and activations with lower precision. Cactus supports INT4, INT8, FP16, and FP32 precision types with hardware-optimized kernels.Precision Types
INT4
4-bit integer
- 0.5 bytes per parameter
- 8x smaller than FP32
- Minimal quality loss
- Best for mobile
INT8
8-bit integer
- 1 byte per parameter
- 4x smaller than FP32
- Near-lossless quality
- Balanced option
FP16
16-bit float
- 2 bytes per parameter
- 2x smaller than FP32
- Lossless quality
- NPU-friendly
Precision Comparison
| Precision | Bytes/Param | 1B Model Size | Relative Speed | Quality |
|---|---|---|---|---|
| FP32 | 4 | ~4 GB | 1x (baseline) | 100% |
| FP16 | 2 | ~2 GB | 1.5-2x | 100% |
| INT8 | 1 | ~1 GB | 2-3x | 99.5% |
| INT4 | 0.5 | ~500 MB | 3-4x | 98-99% |
Recommendation: Use INT4 for most mobile applications. The quality difference is negligible (less than 1% on most benchmarks) while providing the best memory efficiency.
How Quantization Works
Group Quantization
Cactus uses group quantization where weights are quantized in groups of 32 elements with per-group FP16 scales:- Better accuracy than per-tensor quantization
- Captures local weight distribution
- Group size of 32 optimizes for ARM SIMD (NEON)
- Minimal overhead (one FP16 scale per 32 weights)
INT4 Packing
INT4 weights are packed 2 values per byte:- 2 values per byte → 2x memory reduction vs INT8
- Group size still 32 for scale alignment
- Unpacking happens in SIMD kernels
Quantization in Practice
Choosing Precision
- INT4 (Recommended)
- INT8 (Balanced)
- FP16 (High Quality)
Best for most mobile applications✅ Use INT4 when:Benchmark (LFM2-1.2B on iPhone 17 Pro):
- Deploying on phones/tablets
- Model size > 500M parameters
- RAM is limited (< 4GB)
- Battery life matters
- Target inference speed > 20 tokens/sec
- Language models: 0.5-1% perplexity increase
- Vision models: 0.2-0.5% accuracy drop
- Transcription: Negligible WER increase
| Precision | RAM | Decode Speed |
|---|---|---|
| INT4 | 700MB | 48 t/s |
| INT8 | 1.4GB | 42 t/s |
| FP16 | 2.8GB | 35 t/s |
Quantization at Inference Time
Cactus handles mixed precision automatically:Weight Quantization
Weight Quantization
Weights are quantized once during model conversion:Dequantization happens inside SIMD kernels:
Activation Quantization
Activation Quantization
Activations are quantized dynamically:Per-tensor quantization:
- Find max absolute value across tensor
- Scale = max_abs / 127.0
- Quantize: int8_val = fp16_val / scale
KV Cache Quantization
KV Cache Quantization
Key-Value cache is quantized to INT8 to save memory:Memory Savings:
Based on 28-layer model with 8 KV heads and 128 head dim
| Context Length | FP16 Cache | INT8 Cache | Savings |
|---|---|---|---|
| 512 tokens | 64 MB | 32 MB | 50% |
| 1024 tokens | 128 MB | 64 MB | 50% |
| 2048 tokens | 256 MB | 128 MB | 50% |
Optimized Kernels
Cactus provides SIMD-optimized kernels for quantized operations:INT4 Matrix Multiplication
- Unpack INT4 → INT8 in SIMD registers
- INT8×INT8 → INT32 accumulation (16 elements at a time)
- Apply scales and convert to FP16
- ~3-4x faster than FP32 matmul
INT8 Matrix Multiplication
- INT8×INT8 → INT32 dot products
- 16-way SIMD parallelism
- Fused dequantization
- ~2-3x faster than FP32 matmul
Hybrid Attention (INT8 Cache + FP16 Queries)
- Fused attention over INT8 cache + FP16 new tokens
- Dequantization happens in-kernel
- Sliding window support
- No quality loss with group quantization
Code Examples
Quantizing Weights (Python)
Using Graph API with Quantization (C++)
Model Conversion Script
Quantization Performance
Memory Usage
1.2B Parameter Model (LFM2-1.2B):| Component | FP32 | FP16 | INT8 | INT4 |
|---|---|---|---|---|
| Weights | 4.8 GB | 2.4 GB | 1.2 GB | 600 MB |
| Activations | 128 MB | 64 MB | 32 MB | 32 MB |
| KV Cache (1K ctx) | 256 MB | 128 MB | 64 MB | 64 MB |
| Total | 5.2 GB | 2.6 GB | 1.3 GB | 700 MB |
Inference Speed
iPhone 17 Pro, LFM2-1.2B:| Precision | Prefill | Decode | Time to First Token |
|---|---|---|---|
| FP32 | 85 t/s | 22 t/s | 380 ms |
| FP16 | 245 t/s | 35 t/s | 165 ms |
| INT8 | 310 t/s | 42 t/s | 130 ms |
| INT4 | 327 t/s | 48 t/s | 120 ms |
INT4 is fastest because:
- Smaller memory footprint → better cache utilization
- Fewer memory transfers → less bandwidth pressure
- SIMD-optimized INT4 kernels
- Lower precision = higher throughput
Quality Impact
Benchmark Results
LFM2-1.2B on MMLU (0-shot):| Precision | Accuracy | Δ vs FP32 |
|---|---|---|
| FP32 | 52.3% | - |
| FP16 | 52.3% | 0.0% |
| INT8 | 52.1% | -0.2% |
| INT4 | 51.8% | -0.5% |
| Precision | Accuracy | Δ vs FP32 |
|---|---|---|
| FP32 | 45.2% | - |
| FP16 | 45.2% | 0.0% |
| INT8 | 45.0% | -0.2% |
| INT4 | 44.7% | -0.5% |
| Precision | WER | Δ vs FP32 |
|---|---|---|
| FP32 | 3.2% | - |
| FP16 | 3.2% | 0.0% |
| INT8 | 3.2% | 0.0% |
| INT4 | 3.3% | +0.1% |
Key Takeaway: INT4 quantization with group size 32 maintains 98-99% of original model quality while reducing memory by 8x.
Best Practices
1. Choose the Right Precision
1. Choose the Right Precision
- INT4 for models > 500M parameters on mobile
- INT8 for models < 500M parameters or quality-sensitive tasks
- FP16 for NPU execution or maximum quality
- Use
--reconvertif changing precision
2. Use Group Quantization
2. Use Group Quantization
- Always use group size 32 (default)
- Smaller groups (16) = higher quality, larger scales
- Larger groups (64) = more memory efficient, lower quality
- Group size 32 is optimal for ARM NEON
3. Quantize KV Cache
3. Quantize KV Cache
- Enable INT8 KV cache for contexts > 512 tokens
- Saves 50% memory with no quality loss
- Automatic in Cactus (no code changes needed)
4. Profile Before Optimizing
4. Profile Before Optimizing
- Test INT4 quality on your specific use case
- Use perplexity or task-specific metrics
- Compare INT4 vs INT8 vs FP16 speed
5. Fine-Tuning After Quantization
5. Fine-Tuning After Quantization
- Quantize-aware training (QAT) for best quality
- Fine-tune FP32 → quantize → fine-tune INT4
- Use larger learning rate for quantized fine-tuning See Fine-Tuning Guide for details.
Related Resources
Architecture
How quantization fits into Cactus’s design
Models
RAM usage for different model sizes
Graph API
Using quantization in computation graphs
Optimization Guide
Advanced performance tuning