3-Model Speculative Decoding
for OpenClaw
The Challenge
Running LLMs locally is powerful, but slow. Cloud compute is fast, but expensive. What if you could have both?
Slow Inference
Traditional LLM inference on local hardware averages 12.5 tokens/second, limiting real-time applications.
Cloud Costs
Cloud APIs are fast but expensive, especially for continuous or high-volume usage scenarios.
Complex Setup
Optimizing local models requires deep technical knowledge and custom implementations.
The Solution: Pyramid Speculative Decoding
momo-kibidango implements Google Research's breakthrough 3-model pyramid architecture, achieving 1.97x speedup with zero quality degradation.
Tier 1: Haiku 2
Ultra-fast draft model (45.6 tok/s)
Tier 2: Haiku 3
Middle verifier (30.5 tok/s)
Tier 3: Sonnet 3.5
Final authority (12.5 tok/s)
1.97x Faster
Memory Efficient
Runs comfortably on M1/M2/M3/M4 Macs with just 11.6GB memory usage.
Perfect for 16GB MacBooks
Production Ready
Version 1.0.0 includes everything you need for real-world deployment.
OpenClaw Native
Seamlessly integrates with OpenClaw's subagent system. Just install and accelerate.
Full Monitoring
Prometheus metrics, Grafana dashboards, and detailed performance tracking included.
Zero Quality Loss
Mathematically guaranteed to produce identical output to the target model.
Smart Caching
Advanced KV cache management with LRU eviction and cross-model sharing.
Graceful Fallback
Automatically falls back to baseline when confidence is low or errors occur.
MIT Licensed
Open source and free to use in your projects, commercial or otherwise.
Start Optimizing Your Inference
Join the growing community using momo-kibidango to accelerate their local LLMs.