Discussion about this post

User's avatar
Pawel Jozefiak's avatar

Model routing is doing the most work in your breakdown, and that matches what I found too. I took it one step further: the routing layer itself runs locally. A 35B MoE on a $600 Mac Mini M4 pre-classifies every request before it decides whether Claude sees it at all. That step alone cuts cloud API usage by 30-40% in my stack (at least for my workload - not sure how it generalizes). I just swapped the local model from Qwen 3.5 to Gemma 4 and classification dropped from 8.5s to 1.9s.

Full writeup at https://thoughts.jock.pl/p/local-llm-35b-mac-mini-gemma-swap-production-2026 if you want the mmap config details. Have you tested routing combined with semantic caching - do they stack or overlap?

No posts

Ready for more?