AI Cost Optimization: How to Reduce API Bills Without Losing Quality
A weekend project I built last spring cost me $140 in API fees before it saw a single real user. The code worked. The model responses were good. But I'd written every prompt like money was no object — long system prompts, GPT-4 for every request, no caching, no batching. The bill fixed that habit fast.
API costs aren't a reason to avoid AI. They're a reason to be deliberate about where you spend them.
What This Covers
How to cut your AI API costs by 40–70% without degrading the quality your users actually experience. Token counting, model routing, prompt compression, caching strategies, and the Batch API — with working Python code throughout.
Prerequisites
- Python 3.10+
- An OpenAI API key
- Basic familiarity with making API calls
- Some existing code to optimize (or a project you're actively building)
The Token Bill, Explained
Before you can cut costs, you need to understand what's generating them.
Every API call is priced by tokens — roughly 4 characters per token for English text. Most providers charge separately for input tokens (what you send) and output tokens (what comes back). Output tokens are almost always more expensive. On GPT-4o, output tokens cost 3x more than input tokens. On Claude Sonnet, the ratio is 5:1.
Two things follow from this. Long responses are expensive. A model that explains its reasoning in 800 tokens when 150 would do costs you real money at scale. And the system prompt runs on every single request. If your system prompt is 2,000 tokens and you make 10,000 calls a day, that's 20 million input tokens before your users type a word.
Neither of these is obvious until you've seen a bill.
Pick the Model That Matches the Task
Not every request needs your most powerful model. This is the single highest-leverage change most projects can make.
GPT-4o is good — but it's roughly 10x the price of GPT-4o-mini for the same token count. Claude Opus costs about 15x what Haiku costs. If you're using the top-tier model for everything — classification, extraction, routing, simple Q&A — you're overpaying by a factor that compounds quickly.
The pattern is called model routing: use a cheap, fast model for work it can handle, and escalate to a more capable model only when the task demands it.
from openai import OpenAI
client = OpenAI()
def classify_intent(user_message: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Classify the user's intent as one of: question, complaint, "
"purchase, other. Reply with only the category word."
)
},
{"role": "user", "content": user_message}
],
max_tokens=10,
temperature=0
)
return response.choices[0].message.content.strip()
def answer_question(user_message: str, context: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": f"Answer questions using this context:\n\n{context}"
},
{"role": "user", "content": user_message}
],
max_tokens=500
)
return response.choices[0].message.content
def handle_message(user_message: str, context: str) -> str:
intent = classify_intent(user_message)
if intent == "question":
return answer_question(user_message, context)
elif intent == "complaint":
return "I'll connect you with our support team."
else:
return answer_question(user_message, context)
The classification call costs a fraction of a cent. The expensive model only runs when the task actually needs it.
Compress Your Prompts
Every unnecessary word in your system prompt costs you tokens on every single request.
I once audited a production system prompt that had grown to 3,800 tokens. Redundant instructions ("always be helpful, always be professional, always provide accurate information"), a long preamble explaining what the product was, three paragraphs restating the same constraint in different words. After trimming the repetition, it was 900 tokens. That's 2,900 tokens saved per request. At 500,000 requests per month, you're looking at 1.45 billion input tokens — around $1,450/month on GPT-4o.
Prompt compression isn't about cutting corners. It's about writing precisely.
Run this to see what you're actually sending:
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
system_prompt = "You are a helpful customer service assistant for AcmeCorp..."
token_count = count_tokens(system_prompt)
daily_cost_estimate = (token_count / 1_000_000) * 2.50 * 10_000
print(f"System prompt: {token_count} tokens")
print(f"Estimated daily input cost at 10k requests: ${daily_cost_estimate:.2f}")
If your system prompt is over 500 tokens, read it sentence by sentence. Ask whether each one changes the model's behavior in a way you'd notice. Most won't.
Use Prompt Caching
OpenAI automatically caches prompt prefixes that are at least 1,024 tokens. Cached input tokens bill at 50% of the standard rate. You don't have to enable it — it triggers when the same prefix appears repeatedly across requests.
To get cache hits consistently, put your static content first — system prompt, reference documents, examples — and keep the variable content (the user's actual message) at the end.
from openai import OpenAI
client = OpenAI()
SYSTEM_PROMPT = """
You are a technical documentation assistant for AcmeCorp's developer platform.
[The rest of your long, static system prompt — needs to be 1,024+ tokens to qualify]
"""
def answer_with_docs(user_question: str, docs: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": f"Documentation:\n{docs}\n\nQuestion: {user_question}"
}
]
)
usage = response.usage
cached = getattr(usage.prompt_tokens_details, "cached_tokens", 0)
return {
"answer": response.choices[0].message.content,
"cached_tokens": cached,
"total_input_tokens": usage.prompt_tokens,
"cache_savings_pct": round((cached / usage.prompt_tokens) * 100, 1) if usage.prompt_tokens else 0
}
Anthropic's prompt caching is explicit — you mark which parts of your prompt to cache with a cache_control block, and cached tokens bill at 10% of the standard rate. The TTL is 5 minutes by default, extendable to 1 hour. For long system prompts or documents you send on every call, this cuts input costs dramatically.
The Batch API
Not every call needs to complete in under a second. Background jobs, bulk processing, and scheduled reports are all good candidates for the Batch API, which cuts costs by 50% and processes requests within 24 hours.
import json
from openai import OpenAI
client = OpenAI()
def submit_batch(items: list[dict]) -> str:
requests = [
{
"custom_id": f"req-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "Summarize this in one sentence."},
{"role": "user", "content": item["text"]}
],
"max_tokens": 100
}
}
for i, item in enumerate(items)
]
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open("batch_requests.jsonl", "rb") as f:
batch_file = client.files.create(file=f, purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
return batch.id
def retrieve_results(batch_id: str) -> list[dict]:
batch = client.batches.retrieve(batch_id)
if batch.status != "completed":
return []
output = client.files.content(batch.output_file_id)
results = []
for line in output.text.strip().split("\n"):
result = json.loads(line)
results.append({
"id": result["custom_id"],
"summary": result["response"]["body"]["choices"][0]["message"]["content"]
})
return results
One caveat: the 24-hour window is a hard limit, not a guideline. If your use case needs results faster than that, batching isn't the right tool — but if you're running nightly data pipelines or generating weekly reports, the 50% discount is worth the wait.
When Does Model Routing Actually Break Down?
Cheaper, faster, or smarter — you're always trading something. The goal isn't to minimize cost in isolation; it's to minimize cost per useful result.
A 10ms response matters for a live chat interface. It doesn't matter for a nightly report. A 70% cost reduction through model routing works if your cheaper model handles 90% of actual workload well. It backfires if the cheap model fails on 40% of requests and you end up calling the expensive one anyway — now you've made two calls instead of one.
Test routing decisions with real traffic before you deploy. Run 1,000 representative requests through the cheap model and measure quality, error rate, and fallback rate. The math only holds if the assumptions hold.
Control Output Length
Set max_tokens on every call. Don't leave it open.
Most APIs bill for tokens generated up to the cap. Models will also fill the space you give them — a model with no output limit on a "summarize this" prompt will often produce a longer response than you need. If you want a one-sentence summary, say so in the prompt and set max_tokens to 75. If you want five bullet points, set it to 200. Forces precision in both directions.
On GPT-4o, output tokens cost $10/M. If your average response runs 400 tokens when 150 would do, you're spending 2.7x more than necessary on every single output.
Wrapping Up
The cheapest way to run AI at scale is to stop treating every request as equivalent. Classification isn't reasoning. Summarization isn't analysis. A question about a return policy doesn't need the same model as a nuanced complaint that requires judgment.
Start with token counting — it makes the problem visible. Add prompt caching for any system prompt over 1,024 tokens. Route by task complexity. Move batch-eligible work to the Batch API. Together, those four changes will cut most production AI bills by 40–70% without the experience degrading in any way users would notice.
Measure before and after. "Cheaper" only counts if quality holds.
Share