3-Model Speculative Decoding
for OpenClaw

2x faster inference on Apple Silicon with zero quality loss

The Challenge

Running LLMs locally is powerful, but slow. Cloud compute is fast, but expensive. What if you could have both?

Slow Inference

Traditional LLM inference on local hardware averages 12.5 tokens/second, limiting real-time applications.

💸

Cloud Costs

Cloud APIs are fast but expensive, especially for continuous or high-volume usage scenarios.

🔧

Complex Setup

Optimizing local models requires deep technical knowledge and custom implementations.

The Solution: Pyramid Speculative Decoding

momo-kibidango implements Google Research's breakthrough 3-model pyramid architecture, achieving 1.97x speedup with zero quality degradation.

Tier 1: Haiku 2

Ultra-fast draft model (45.6 tok/s)

Tier 2: Haiku 3

Middle verifier (30.5 tok/s)

Tier 3: Sonnet 3.5

Final authority (12.5 tok/s)

1.97x Faster

Baseline12.5 tok/s
With momo-kibidango24.6 tok/s

Memory Efficient

Runs comfortably on M1/M2/M3/M4 Macs with just 11.6GB memory usage.

< 12GB

Perfect for 16GB MacBooks

Production Ready

Version 1.0.0 includes everything you need for real-world deployment.

OpenClaw Native

Seamlessly integrates with OpenClaw's subagent system. Just install and accelerate.

Full Monitoring

Prometheus metrics, Grafana dashboards, and detailed performance tracking included.

Zero Quality Loss

Mathematically guaranteed to produce identical output to the target model.

Smart Caching

Advanced KV cache management with LRU eviction and cross-model sharing.

Graceful Fallback

Automatically falls back to baseline when confidence is low or errors occur.

MIT Licensed

Open source and free to use in your projects, commercial or otherwise.

Start Optimizing Your Inference

Join the growing community using momo-kibidango to accelerate their local LLMs.