Module 435 min read · Building with AI APIs
Tokens and Cost Optimization
Every word you send and every word the model returns costs money. Understanding how tokens work, what determines your costs, and how to reduce them without sacrificing quality is one of the most valuable skills for any production AI engineer. The difference between a naive implementation and an optimized one can be a 10x reduction in costs at scale.
What tokens actually are
AI language models do not process text character by character or word by word. They process tokens — subword units that are the result of a preprocessing step called tokenization. A token can be as long as a full word or as short as a single character, depending on how common that sequence is in the training data.
The intuition: common English words tend to be single tokens. Uncommon words, technical jargon, and non-English text get split into multiple tokens. Numbers and punctuation have their own tokenization patterns that can be surprising.
Common words → one token each
"cat", "the", "and", "Python", "function" — all single tokens. Common programming terms are usually single tokens too because they appear frequently in training data.
Numbers → one token per 1-3 digits
The number "42" is one token. "1234" might be one or two tokens. "12345678" is likely 3-4 tokens. Each digit in very long numbers often gets its own token — which makes arithmetic expensive in token terms.
Non-English text → much higher token count
A sentence in Chinese, Arabic, or Thai will typically use 2-4x more tokens than the same content in English, because the tokenizer was trained on predominantly English text. This dramatically affects cost for multilingual applications.
Whitespace and formatting → tokens too
Newlines, spaces, and indentation all consume tokens. JSON with lots of whitespace uses more tokens than compact JSON. Markdown formatting (##, **, ---) adds tokens without adding information.
Counting tokens with tiktoken
OpenAI's tiktoken library lets you count tokens before making an API call. This is essential for cost estimation, context window management, and building systems that won't accidentally exceed limits.
pip install tiktoken
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens for a text string using the model's tokenizer."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def count_messages_tokens(messages: list, model: str = "gpt-4o") -> int:
"""
Count tokens for a messages array, including message formatting overhead.
Each message has a small overhead (~4 tokens) for the role/content structure.
"""
enc = tiktoken.encoding_for_model(model)
total = 0
for message in messages:
total += 4 # overhead per message (role + content + separators)
for key, value in message.items():
total += len(enc.encode(str(value)))
total += 2 # priming tokens for the assistant reply
return total
# Example usage
system_prompt = "You are a helpful assistant. Answer questions concisely."
user_message = "Explain the difference between TCP and UDP protocols."
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
prompt_tokens = count_messages_tokens(messages)
print(f"System prompt: {count_tokens(system_prompt)} tokens")
print(f"User message: {count_tokens(user_message)} tokens")
print(f"Total prompt tokens (with overhead): {prompt_tokens}")
# Tokenization quirk demonstration
words = ["cat", "concatenate", "pneumonoultramicroscopicsilicovolcanoconiosis"]
for word in words:
tokens = tiktoken.encoding_for_model("gpt-4o").encode(word)
print(f"{word!r:50} → {len(tokens)} token(s): {tokens}")
Rule of thumb
For English text: 1 token ≈ 4 characters ≈ 0.75 words. A typical paragraph of 100 words is roughly 130 tokens. A full page (400–500 words) is roughly 500–650 tokens. These are approximations — always measure with tiktoken for precision.
Context window limits
Every model has a maximum context window — the total number of tokens it can process in a single request (input + output combined). Exceeding this limit results in an error. Understanding these limits is critical for designing systems that handle long documents or extended conversations.
| Model | Context window | Approximate pages |
| gpt-4o-mini | 128,000 tokens | ~400 pages |
| gpt-4o | 128,000 tokens | ~400 pages |
| claude-opus-4-5 | 200,000 tokens | ~650 pages |
| claude-haiku-3-5 | 200,000 tokens | ~650 pages |
| gemini-1.5-pro | 2,000,000 tokens | ~6,500 pages |
Large context windows do not mean you should use them carelessly. Cost scales linearly with token count, and latency increases with context length. Just because you can fit an entire codebase in the context does not mean you should — targeted retrieval is usually faster and cheaper.
Pricing: input vs output tokens
All providers price input tokens and output tokens differently — output tokens cost more because they require the model to generate each token sequentially (autoregressive generation), while input tokens are processed in parallel. Output typically costs 3-5x more per token than input at equivalent quality levels.
| Model | Input (per M tokens) | Output (per M tokens) |
| gpt-4o-mini | $0.15 | $0.60 |
| gpt-4o | $2.50 | $10.00 |
| claude-haiku-3-5 | $0.80 | $4.00 |
| claude-sonnet-4-5 | $3.00 | $15.00 |
| claude-opus-4-5 | $15.00 | $75.00 |
| gemini-1.5-flash | $0.075 | $0.30 |
| gemini-1.5-pro | $1.25 | $5.00 |
Prices change frequently
AI pricing is in active flux. The figures above are indicative — always check the provider's current pricing page before building cost estimates into business plans or contracts. Providers regularly reduce prices as infrastructure improves.
Cost estimation formula
def estimate_cost(
prompt_tokens: int,
completion_tokens: int,
model: str = "gpt-4o"
) -> float:
"""Estimate API call cost in USD."""
pricing = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-opus-4-5": {"input": 15.00, "output": 75.00},
"claude-sonnet-4-5":{"input": 3.00, "output": 15.00},
"claude-haiku-3-5":{"input": 0.80, "output": 4.00},
"gemini-1.5-pro": {"input": 1.25, "output": 5.00},
"gemini-1.5-flash":{"input": 0.075, "output": 0.30},
}
if model not in pricing:
raise ValueError(f"Unknown model: {model}")
p = pricing[model]
input_cost = (prompt_tokens / 1_000_000) * p["input"]
output_cost = (completion_tokens / 1_000_000) * p["output"]
return input_cost + output_cost
# A document summarization call: 3000 input tokens, 400 output tokens
cost = estimate_cost(3000, 400, "gpt-4o")
print(f"Single call cost: ${cost:.6f}") # ~$0.0115
# At scale: 1000 such calls per day
daily_cost = cost * 1000
monthly_cost = daily_cost * 30
print(f"Daily: ${daily_cost:.2f} | Monthly: ${monthly_cost:.2f}")
# Daily: $11.50 | Monthly: $345.00
# Same workload on gpt-4o-mini
cost_mini = estimate_cost(3000, 400, "gpt-4o-mini")
print(f"gpt-4o-mini monthly: ${cost_mini * 30000:.2f}")
# gpt-4o-mini monthly: $7.65 — 45x cheaper
Prompt caching
Both Anthropic and OpenAI offer prompt caching — a mechanism that stores the KV (key-value) cache of your prompt on their servers and reuses it for subsequent requests with the same prefix. This can reduce costs by 50-90% for systems that repeat long system prompts or document contexts across many calls.
Anthropic cache_control
With Anthropic, you explicitly mark cache breakpoints using cache_control. Content before the last breakpoint is eligible for caching. Cached reads cost 10% of normal input token price; writes (first request) cost 125% of normal input price but are amortized over subsequent reads.
import anthropic
client = anthropic.Anthropic()
# A 5000-token document you process many questions about
long_document = "..." # your document content here
def ask_about_document(question: str, document: str) -> str:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a document analyst. Answer questions about the provided document accurately.",
"cache_control": {"type": "ephemeral"} # cache the system prompt
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Document:\n\n{document}",
"cache_control": {"type": "ephemeral"} # cache the document
},
{
"type": "text",
"text": f"Question: {question}"
# question is NOT cached — it changes each time
}
]
}
]
)
# Check cache performance
usage = response.usage
print(f"Cache write tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}")
print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0)}")
return response.content[0].text
# First call: cache miss — pays 125% for cache write
answer1 = ask_about_document("What is the main conclusion?", long_document)
# Subsequent calls: cache hit — pays only 10% for cache read
answer2 = ask_about_document("What methodology was used?", long_document)
answer3 = ask_about_document("List the key findings.", long_document)
OpenAI automatic caching
OpenAI caches automatically — you do not opt in. Any prompt prefix longer than 1024 tokens that has been seen recently will be served from cache at 50% of the normal input token price. You can check response.usage.prompt_tokens_details.cached_tokens to see how many tokens were cached.
from openai import OpenAI
client = OpenAI()
# OpenAI caches automatically — no special parameters needed
# Just ensure your system prompt + repeated content comes first
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "..." * 300}, # long system prompt — will be cached
{"role": "user", "content": "Specific question that changes per request"}
]
)
# Check if caching kicked in
details = response.usage.model_dump().get("prompt_tokens_details", {})
cached = details.get("cached_tokens", 0)
regular = response.usage.prompt_tokens - cached
print(f"Regular input tokens: {regular} | Cached tokens: {cached}")
print(f"Effective cost savings: {cached * 0.5 / 1_000_000 * 2.50:.6f} USD")
Context compression techniques
For long-running conversations, you must actively manage the context to avoid runaway costs and hitting context limits.
Sliding window
Keep only the N most recent turns. Simple and predictable, but loses older context entirely. Best for applications where recent context is most important (customer support, coding assistants).
Summarization
When the conversation grows long, call the API to summarize earlier turns, then replace them with the summary. Preserves semantics at much lower token cost. Adds latency at the summarization point.
Selective history
Use a retrieval step to include only the most relevant past turns, not all of them. Requires embedding the conversation turns and doing similarity search — adds complexity but scales to very long conversations.
from openai import OpenAI
import tiktoken
client = OpenAI()
enc = tiktoken.encoding_for_model("gpt-4o-mini")
MAX_HISTORY_TOKENS = 3000
def trim_messages_to_budget(messages: list, max_tokens: int) -> list:
"""
Trim conversation history to stay within a token budget.
Always keeps: system message + last user message.
Drops oldest turns first.
"""
if not messages:
return messages
system_msgs = [m for m in messages if m["role"] == "system"]
other_msgs = [m for m in messages if m["role"] != "system"]
# Count tokens for required messages
def token_count(msgs):
return sum(len(enc.encode(m["content"])) + 4 for m in msgs)
system_tokens = token_count(system_msgs)
budget = max_tokens - system_tokens
# Keep as many recent messages as fit, always keeping the last user message
kept = []
remaining = budget
for msg in reversed(other_msgs):
cost = len(enc.encode(msg["content"])) + 4
if remaining - cost >= 0:
kept.insert(0, msg)
remaining -= cost
elif not kept:
# Always keep at least the most recent message
kept.insert(0, msg)
break
return system_msgs + kept
def chat_with_budget(conversation: list, user_input: str) -> str:
conversation.append({"role": "user", "content": user_input})
trimmed = trim_messages_to_budget(conversation, MAX_HISTORY_TOKENS)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=trimmed,
max_tokens=500
)
reply = response.choices[0].message.content
conversation.append({"role": "assistant", "content": reply})
return reply
The cost / quality / latency triangle
Every production AI decision involves trading off three variables. Improving one typically costs another.
- Use a smaller model when the task is simple. gpt-4o-mini handles classification, summarization, and Q&A over well-structured text as well as gpt-4o at 10-17x lower cost.
- Shorten your prompts by removing filler and redundancy. Shorter, cleaner prompts often get better responses anyway.
- Cache aggressively for any repeated prompt prefix. A 5000-token system prompt sent 1000 times per day becomes 100 tokens per call after caching.
- Batch where possible — some tasks (embedding generation, classification) can be processed in batches at 50% cost with OpenAI's batch API.
- Set tight max_tokens for tasks where output length is predictable. Do not give a classification task 2000 output tokens.
The engineer's mindset on cost
Do not optimize prematurely, but do instrument from day one. Log every call's prompt_tokens, completion_tokens, model, and cost. Build a dashboard. You cannot optimize what you do not measure — and costs that look trivial in development can become thousands of dollars monthly in production.