LLM Cost Optimization
Reducing LLM Costs by 70%
I’m Amit Shekhar from Outcome School, where I teach AI and Machine Learning.
Let’s get started.
Question: You launched an LLM-powered customer support chatbot 3 months ago. It uses GPT-4 class model for every request. The monthly API cost started at $5,000 and has now grown to $40,000 as user traffic increased. The budget for the chatbot is $10,000 per month. How do you reduce the cost without degrading the user experience significantly?
Answer: Large language models like GPT-4 charge per token - both input and output tokens. When every single user query goes through a large model with long prompts and lengthy responses, costs scale linearly with traffic. The problem is that not every query needs the most powerful and expensive model. Many queries are simple and repetitive, yet they all consume the same expensive resources.
Now, let us look at the solutions.
Solution 1 - Implement a Model Routing Strategy
We do not use the same model for every query. We build a router that classifies the query complexity and routes it to the appropriate model. Simple queries like “What are your business hours?” or “How do I reset my password?” go to a smaller, cheaper model like GPT-3.5 or a fine-tuned open-source model. Only complex queries that need deep reasoning go to GPT-4. In most customer support scenarios, 70 to 80% of queries are simple. This alone can cut costs by 60% or more.
Solution 2 - Cache Common Responses
Many customer support queries are repetitive. We implement semantic caching - when a new query comes in, we check if a similar query was already answered recently. We use embedding similarity to find matches. If a cached response exists with high similarity, we return it directly without calling the LLM. We set a TTL on cached responses to keep them fresh. This eliminates redundant API calls entirely for common questions.
Solution 3 - Reduce Token Usage
We optimize both input and output tokens. For input: we shorten the system prompt, remove redundant instructions, and compress the conversation history (summarize older messages instead of sending the full history). For output: we instruct the model to be concise and set a max_tokens limit. We also avoid sending unnecessary context. Every token saved reduces cost directly. A well-optimized prompt can reduce token usage by 40 to 50%.
Solution 4 - Fine-Tune a Smaller Model
We take the conversation logs from the past 3 months and fine-tune a smaller, cheaper model specifically for our customer support domain. A fine-tuned GPT-3.5 or an open-source model like Llama can match GPT-4 performance for domain-specific tasks because it has learned the specific patterns, terminology, and response style. The inference cost of a fine-tuned smaller model is 10x to 20x cheaper than GPT-4.
Solution 5 - Set Up Cost Monitoring and Rate Limiting
We implement real-time cost tracking per user, per conversation, and per day. We set daily and monthly budget caps. When the budget threshold is approached, we automatically switch to cheaper models or limit the number of LLM calls per conversation. We also identify and block abusive users who send an unusually high volume of requests. Cost visibility helps us understand where the money is going and make informed optimization decisions.
This is how we bring LLM costs under control in production. The biggest impact comes from model routing and caching. These two strategies alone typically reduce costs by 70% or more for customer support applications.
If you want to learn everything about the LLM, RAG, MCP, Agent, Fine-tuning, and Quantization, refer to the AI Engineering Explained: LLM, RAG, MCP, Agent, Fine-Tuning, Quantization.
Thanks
Amit Shekhar
Founder, Outcome School



Model routing is doing the most work in your breakdown, and that matches what I found too. I took it one step further: the routing layer itself runs locally. A 35B MoE on a $600 Mac Mini M4 pre-classifies every request before it decides whether Claude sees it at all. That step alone cuts cloud API usage by 30-40% in my stack (at least for my workload - not sure how it generalizes). I just swapped the local model from Qwen 3.5 to Gemma 4 and classification dropped from 8.5s to 1.9s.
Full writeup at https://thoughts.jock.pl/p/local-llm-35b-mac-mini-gemma-swap-production-2026 if you want the mmap config details. Have you tested routing combined with semantic caching - do they stack or overlap?