When a generative AI bill gets uncomfortable, the reflex is to switch models. Move from the flagship to the mid-tier. Move from the mid-tier to the small one. It's a real lever, sometimes a 5x one, and it's the first thing every team tries.
It's also the lever with the lowest ceiling. Model selection saves you once. You pick the smaller model, the bill drops, and then the bill grows again as your traffic grows. The savings don't compound. The next 10x of traffic costs you 10x more.
Token optimization is the lever that compounds. Every token you don't send is a token you'll never send again, across every call your application makes for the rest of its lifetime. A team that spends two weeks restructuring prompts for cache efficiency saves on every single request: today, next quarter, three years from now. That's the difference between a tactical trim and a structural change.
The challenge is that "token optimization" is not one thing. It's at least four distinct layers, each with different effort, different risk, and different magnitude of savings. The most common mistake we see is teams jumping straight to architectural rework (RAG, distillation, decomposition) while leaving thousands of dollars on the table at the prompt layer. Below is the field guide we wish more teams had before they started.
Layer 1: Prompt-Level Optimization
The first layer is the cheapest to implement and almost always the highest ROI. It targets the input side of your LLM bill: the tokens you send, before the model has even started thinking.
Prompt caching, done right
Every major provider now offers a cached-prefix discount: tokens that match a previously seen prefix get billed at roughly 10% of the standard input rate. The math is irresistible. A system prompt and few-shot block that takes 4,000 tokens drops from full price to a tenth of it on every subsequent call.
Most teams have caching turned on. Very few have it tuned. The discount only applies when the cached prefix is byte-identical to a previous request, which means any of these patterns will silently destroy your cache hit rate:
- Dynamic content in the system prompt. A
datetime.now()stamp, a user ID, a session token, or a feature flag value (anything that changes per request) means every request is a cache miss. The fix is mechanical: move all dynamic content into the user message, where it doesn't affect the cached prefix. - Reordered few-shot examples. Random shuffling of in-context examples is a common "diversity" pattern that turns every request into a unique cache key. Fix the order, then test whether the diversity was buying you anything (usually no).
- Mid-prompt personalization. Injecting "the user's name is Sarah" between the system prompt and the actual question forces the model to re-parse from that point forward. Move personalization to the end of the user message.
The cost impact of getting this wrong is large and entirely invisible until you measure it.
# Cost math: 10M requests/day, 1,000-token system prompt
# Cached prefix: $0.0003/1K tokens
# Uncached prefix: $0.003/1K tokens
# With caching working: 10M × 1K × $0.0003/1K = $3,000/day
# With caching broken: 10M × 1K × $0.003/1K = $30,000/day
# Difference: $27,000/day
A $9.8M annual swing from a pattern most teams never audit.
System prompt minimization
The other prompt-level win is brutally simple: most system prompts are bloated. Audit yours. Look for:
- Instructions the model already follows without being told (modern models default to professional, helpful, and honest, so you don't need 200 tokens explaining this).
- Redundant constraints stated three different ways for emphasis (state once, in the most precise form).
- Outdated examples kept around for fear of breaking something (test their removal; most are decorative).
- Defensive instructions for failure modes that no longer occur (older models needed more handholding; newer ones don't).
A typical system prompt audit cuts 30–50% of input tokens with no measurable quality loss. Combined with proper caching, the compound effect is multiplicative.
Layer 2: Output-Level Optimization
Output tokens cost roughly four to five times more than input tokens at most providers. This single fact reorders most teams' optimization priorities once they see it. If your application's output is verbose, the output side is where the money is.
Structured output and tool use
The single largest output-side win is moving from natural language responses to structured output: JSON mode, function calling, or tool use. A model asked "what's the user's intent?" might respond with a 200-token explanation. The same model asked to return {"intent": "..."} via a tool call returns five tokens. The information content is identical. The cost is 40x lower.
For any LLM call whose output is consumed by code rather than read by a human, structured output should be the default. The rule of thumb: if your application immediately parses the response with a regex or JSON loader, you're paying for prose you're going to throw away.
Output token caps
The max_tokens parameter is one of the most underused cost controls in production LLM code. Most teams leave it at the provider default (often 4K or 8K), even though their actual response distribution rarely exceeds 500 tokens. The default is not a budget. It's a ceiling.
Audit the actual output token distribution of your top three highest-volume LLM calls. The 99th percentile is usually a fraction of the default. Setting max_tokens to roughly 1.5x the 99th percentile catches runaway generations (a real failure mode where models loop or repeat) without affecting normal traffic.
Streaming with early termination
For interactive applications, streaming responses lets you stop generation the moment you have enough. A summarization endpoint that streams output and terminates as soon as the user navigates away saves the remaining generation cost, often 30–60% of the per-request output bill on dashboards and chat interfaces where users skim.
Layer 3: Architectural Optimization
The third layer requires real engineering work but unlocks the largest savings: the ones that change the cost shape of your application rather than just trimming around the edges.
Model cascading
Most production LLM workloads have a long tail of easy queries and a small head of genuinely hard ones. Routing all traffic through your most capable model is paying for the head's intelligence on the tail's workload.
Cascading sends every query first to a small, cheap model (or a classifier). If the small model is confident, you return its answer. If not, you escalate to the larger model. A well-tuned cascade routes 70–90% of traffic to the cheap tier with no measurable quality impact, because most queries genuinely are easy. The cost reduction is roughly the ratio of cheap-to-expensive pricing times the cascade hit rate, which typically lands in the 50–70% range.
Retrieval-Augmented Generation (done for cost, not just quality)
RAG is usually framed as a grounding technique. It's also a token reduction technique. Sending an entire 50,000-token knowledge base into every prompt is enormously expensive and almost always unnecessary. The model only needs the specific chunks relevant to the current query.
A well-tuned retrieval layer cuts input tokens by 80–95% on context-heavy applications. The trade-off is the engineering cost of building and maintaining the retrieval pipeline, plus the quality risk if your chunking or embedding strategy misses relevant context. For document AI, customer support, and search-augmented assistants, this is almost always worth it.
Decomposition
Complex tasks ("write a market analysis with citations and a chart") that get sent to a single expensive model can often be decomposed into a sequence of smaller tasks ("plan the structure," "draft each section," "verify citations") where most steps can use cheaper models.
The cost math is favorable when the decomposed steps can use a model that's 5–10x cheaper than the monolithic call. The engineering math is favorable when the same decomposition makes the system easier to debug and improve, which it usually does. The risk is latency: a chain of sequential calls adds up, and parallelizing requires more careful state management.
Embeddings-first routing
The cheapest LLM call is the one you don't make. Many features that look like they need a language model can be solved with semantic search over precomputed embeddings: FAQ matching, intent classification, semantic deduplication, and similar-content recommendation. Embeddings cost a fraction of a generation call and have effectively zero per-query inference cost once the index is built.
The audit question is: for each LLM call in your application, could a vector similarity match against a curated corpus achieve the same outcome 80% of the time? If yes, that 80% should never reach the LLM.
Response caching
If the same query produces the same response, why pay twice? A cache keyed on normalized inputs (lowercased, whitespace-stripped, with personalization removed) often catches 20–40% of production LLM traffic on applications with repeating user patterns like documentation search, customer support, and internal tools. The cache itself costs nothing meaningful; the savings are pure.
Layer 4: Governance
The final layer isn't about reducing per-call cost. It's about preventing runaway cost from any single call, user, or feature. Without it, every optimization above is undermined by the next bug that ships.
Token budgets enforced in code
Most teams monitor token spend on a dashboard. Few enforce it programmatically. The dashboard alerts you after the fact; in-code enforcement prevents the spend in the first place. A per-request token budget that rejects oversized prompts before they're sent to the provider is one of the highest-ROI ten-line changes you can make.
Per-tenant and per-user quotas
For multi-tenant applications, a single abusive user or runaway integration can blow through a month's budget in hours. Per-tenant rate limits and token quotas, with circuit breakers that trip on anomalous usage, contain the blast radius. The rule of thumb: any quota you don't enforce in code, you'll eventually enforce after an incident.
Token-level anomaly detection
Aggregate dashboards miss the failure modes that matter. A single endpoint suddenly generating 50,000 tokens per request when its historical p99 was 500 is an incident in progress, but it'll look fine on a monthly chart. Detection at the per-endpoint, per-call distribution level is where you catch token explosions in hours rather than weeks.
Cost-per-feature attribution
You can't optimize what you can't attribute. Every LLM call in production should carry enough metadata (feature ID, team, customer tier, request type) that you can answer "what did the support copilot cost last month?" without a forensic investigation. This is the same attribution discipline that applies to GPU and SageMaker spend, applied to tokens.
The Lever Taxonomy
Putting the four layers side by side clarifies which lever to pull first. The pattern is consistent: prompt-level optimizations have the best ratio of savings to effort, output-level optimizations come next, architectural changes have the highest ceiling but require real investment, and governance prevents the savings from being undone.
| Lever | Typical Savings | Effort | Risk |
|---|---|---|---|
| Prompt caching (done right) | 50–90% on cached portions | Low | Low |
| System prompt minimization | 20–40% of input | Low | Low |
| Output token caps | 30–60% of output | Low | Medium |
| Structured output / tool use | 60–80% of output | Medium | Low |
| Streaming + early termination | 30–60% per request | Medium | Low |
| Model cascading | 50–70% overall | Medium | Medium |
| RAG (for cost) | 80–95% of context-heavy input | High | Medium |
| Decomposition | 40–60% overall | High | Medium |
| Embeddings-first routing | Up to 100% on offloaded queries | Medium | Medium |
| Response caching | 20–40% of total traffic | Low | Low |
How to Know Which Lever to Pull First
The taxonomy is general; your application is specific. The right first move depends entirely on where your tokens are actually going, and most teams cannot answer that question with precision.
Before you optimize, you need to know three things:
Input vs. output split. If 80% of your spend is on output tokens, prompt caching saves you almost nothing. Structured output is your first move. If 80% is input, the inverse.
Per-feature distribution. A single endpoint frequently accounts for 60–80% of total LLM cost. Optimizing the long tail of low-volume features is wasted engineering effort. Find the head first.
Cache hit rate by endpoint. If you think you have caching enabled but your provider dashboard shows a 12% hit rate, you have a cache-busting pattern you don't know about. Find it before doing anything else.
This is where most teams hit a wall. The data exists (provider APIs expose token-level usage data per request), but stitching it back to the feature, team, or customer that triggered the call is the work most teams never get around to doing.
The Underlying Problem: Token Spend Without Token Visibility
Every optimization above assumes you can attribute token spend down to the feature, the endpoint, and the user pattern that generated it. Without that attribution, "optimize tokens" is a directional goal, not an executable plan. You're guessing which lever to pull.
This is the part of the AI cost intelligence picture that gets the least attention, and the one most teams come to MLCostIntel for first. We attribute every token call across Bedrock, OpenAI, Anthropic, and self-hosted inference back to the feature, team, model, and customer that triggered it. Cache hit rates by endpoint. Input/output split by feature. Cost-per-customer for generative AI features. Anomaly detection at the per-call distribution level. The visibility that turns "we should optimize tokens" into "we should restructure the support copilot's system prompt, which is responsible for 41% of last month's spend and has a 14% cache hit rate."
Token optimization isn't a one-time project. It's a permanent capability you build, and the first thing it requires is visibility into where the tokens are actually going. If your generative AI bill is bigger than you'd like and you don't know exactly which feature is responsible, try MLCostIntel free. We connect to your AWS account, attribute your generative AI spend across the full stack, and show you which lever to pull first.