Module 435 min read · Building with AI APIs

Tokens and Cost Optimization

Every word you send and every word the model returns costs money. Understanding how tokens work, what determines your costs, and how to reduce them without sacrificing quality is one of the most valuable skills for any production AI engineer. The difference between a naive implementation and an optimized one can be a 10x reduction in costs at scale.

What tokens actually are

AI language models do not process text character by character or word by word. They process tokens — subword units that are the result of a preprocessing step called tokenization. A token can be as long as a full word or as short as a single character, depending on how common that sequence is in the training data.

The intuition: common English words tend to be single tokens. Uncommon words, technical jargon, and non-English text get split into multiple tokens. Numbers and punctuation have their own tokenization patterns that can be surprising.

Common words → one token each

"cat", "the", "and", "Python", "function" — all single tokens. Common programming terms are usually single tokens too because they appear frequently in training data.

Numbers → one token per 1-3 digits

The number "42" is one token. "1234" might be one or two tokens. "12345678" is likely 3-4 tokens. Each digit in very long numbers often gets its own token — which makes arithmetic expensive in token terms.

Non-English text → much higher token count

A sentence in Chinese, Arabic, or Thai will typically use 2-4x more tokens than the same content in English, because the tokenizer was trained on predominantly English text. This dramatically affects cost for multilingual applications.

Whitespace and formatting → tokens too

Newlines, spaces, and indentation all consume tokens. JSON with lots of whitespace uses more tokens than compact JSON. Markdown formatting (##, **, ---) adds tokens without adding information.

Counting tokens with tiktoken

OpenAI's tiktoken library lets you count tokens before making an API call. This is essential for cost estimation, context window management, and building systems that won't accidentally exceed limits.

pip install tiktoken

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a text string using the model's tokenizer."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def count_messages_tokens(messages: list, model: str = "gpt-4o") -> int:
    """
    Count tokens for a messages array, including message formatting overhead.
    Each message has a small overhead (~4 tokens) for the role/content structure.
    """
    enc = tiktoken.encoding_for_model(model)
    total = 0

    for message in messages:
        total += 4  # overhead per message (role + content + separators)
        for key, value in message.items():
            total += len(enc.encode(str(value)))

    total += 2  # priming tokens for the assistant reply
    return total

# Example usage
system_prompt = "You are a helpful assistant. Answer questions concisely."
user_message = "Explain the difference between TCP and UDP protocols."

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_message}
]

prompt_tokens = count_messages_tokens(messages)
print(f"System prompt: {count_tokens(system_prompt)} tokens")
print(f"User message: {count_tokens(user_message)} tokens")
print(f"Total prompt tokens (with overhead): {prompt_tokens}")

# Tokenization quirk demonstration
words = ["cat", "concatenate", "pneumonoultramicroscopicsilicovolcanoconiosis"]
for word in words:
    tokens = tiktoken.encoding_for_model("gpt-4o").encode(word)
    print(f"{word!r:50} → {len(tokens)} token(s): {tokens}")

Rule of thumb

For English text: 1 token ≈ 4 characters ≈ 0.75 words. A typical paragraph of 100 words is roughly 130 tokens. A full page (400–500 words) is roughly 500–650 tokens. These are approximations — always measure with tiktoken for precision.

Context window limits

Every model has a maximum context window — the total number of tokens it can process in a single request (input + output combined). Exceeding this limit results in an error. Understanding these limits is critical for designing systems that handle long documents or extended conversations.

Model	Context window	Approximate pages
gpt-4o-mini	128,000 tokens	~400 pages
gpt-4o	128,000 tokens	~400 pages
claude-opus-4-5	200,000 tokens	~650 pages
claude-haiku-3-5	200,000 tokens	~650 pages
gemini-1.5-pro	2,000,000 tokens	~6,500 pages

Large context windows do not mean you should use them carelessly. Cost scales linearly with token count, and latency increases with context length. Just because you can fit an entire codebase in the context does not mean you should — targeted retrieval is usually faster and cheaper.

Pricing: input vs output tokens

All providers price input tokens and output tokens differently — output tokens cost more because they require the model to generate each token sequentially (autoregressive generation), while input tokens are processed in parallel. Output typically costs 3-5x more per token than input at equivalent quality levels.

Model	Input (per M tokens)	Output (per M tokens)
gpt-4o-mini	$0.15	$0.60
gpt-4o	$2.50	$10.00
claude-haiku-3-5	$0.80	$4.00
claude-sonnet-4-5	$3.00	$15.00
claude-opus-4-5	$15.00	$75.00
gemini-1.5-flash	$0.075	$0.30
gemini-1.5-pro	$1.25	$5.00

Prices change frequently

AI pricing is in active flux. The figures above are indicative — always check the provider's current pricing page before building cost estimates into business plans or contracts. Providers regularly reduce prices as infrastructure improves.

Cost estimation formula

def estimate_cost(
    prompt_tokens: int,
    completion_tokens: int,
    model: str = "gpt-4o"
) -> float:
    """Estimate API call cost in USD."""

    pricing = {
        "gpt-4o":          {"input": 2.50,  "output": 10.00},
        "gpt-4o-mini":     {"input": 0.15,  "output": 0.60},
        "claude-opus-4-5": {"input": 15.00, "output": 75.00},
        "claude-sonnet-4-5":{"input": 3.00, "output": 15.00},
        "claude-haiku-3-5":{"input": 0.80,  "output": 4.00},
        "gemini-1.5-pro":  {"input": 1.25,  "output": 5.00},
        "gemini-1.5-flash":{"input": 0.075, "output": 0.30},
    }

    if model not in pricing:
        raise ValueError(f"Unknown model: {model}")

    p = pricing[model]
    input_cost  = (prompt_tokens     / 1_000_000) * p["input"]
    output_cost = (completion_tokens / 1_000_000) * p["output"]
    return input_cost + output_cost

# A document summarization call: 3000 input tokens, 400 output tokens
cost = estimate_cost(3000, 400, "gpt-4o")
print(f"Single call cost: ${cost:.6f}")  # ~$0.0115

# At scale: 1000 such calls per day
daily_cost = cost * 1000
monthly_cost = daily_cost * 30
print(f"Daily: ${daily_cost:.2f} | Monthly: ${monthly_cost:.2f}")
# Daily: $11.50 | Monthly: $345.00

# Same workload on gpt-4o-mini
cost_mini = estimate_cost(3000, 400, "gpt-4o-mini")
print(f"gpt-4o-mini monthly: ${cost_mini * 30000:.2f}")
# gpt-4o-mini monthly: $7.65 — 45x cheaper

Prompt caching

Both Anthropic and OpenAI offer prompt caching — a mechanism that stores the KV (key-value) cache of your prompt on their servers and reuses it for subsequent requests with the same prefix. This can reduce costs by 50-90% for systems that repeat long system prompts or document contexts across many calls.

Anthropic cache_control

With Anthropic, you explicitly mark cache breakpoints using cache_control. Content before the last breakpoint is eligible for caching. Cached reads cost 10% of normal input token price; writes (first request) cost 125% of normal input price but are amortized over subsequent reads.

import anthropic
client = anthropic.Anthropic()

# A 5000-token document you process many questions about
long_document = "..." # your document content here

def ask_about_document(question: str, document: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a document analyst. Answer questions about the provided document accurately.",
                "cache_control": {"type": "ephemeral"}  # cache the system prompt
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Document:\n\n{document}",
                        "cache_control": {"type": "ephemeral"}  # cache the document
                    },
                    {
                        "type": "text",
                        "text": f"Question: {question}"
                        # question is NOT cached — it changes each time
                    }
                ]
            }
        ]
    )

    # Check cache performance
    usage = response.usage
    print(f"Cache write tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}")
    print(f"Cache read tokens:  {getattr(usage, 'cache_read_input_tokens', 0)}")

    return response.content[0].text

# First call: cache miss — pays 125% for cache write
answer1 = ask_about_document("What is the main conclusion?", long_document)

# Subsequent calls: cache hit — pays only 10% for cache read
answer2 = ask_about_document("What methodology was used?", long_document)
answer3 = ask_about_document("List the key findings.", long_document)

OpenAI automatic caching

OpenAI caches automatically — you do not opt in. Any prompt prefix longer than 1024 tokens that has been seen recently will be served from cache at 50% of the normal input token price. You can check response.usage.prompt_tokens_details.cached_tokens to see how many tokens were cached.

from openai import OpenAI
client = OpenAI()

# OpenAI caches automatically — no special parameters needed
# Just ensure your system prompt + repeated content comes first
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "..." * 300},  # long system prompt — will be cached
        {"role": "user", "content": "Specific question that changes per request"}
    ]
)

# Check if caching kicked in
details = response.usage.model_dump().get("prompt_tokens_details", {})
cached = details.get("cached_tokens", 0)
regular = response.usage.prompt_tokens - cached
print(f"Regular input tokens: {regular} | Cached tokens: {cached}")
print(f"Effective cost savings: {cached * 0.5 / 1_000_000 * 2.50:.6f} USD")

Context compression techniques

For long-running conversations, you must actively manage the context to avoid runaway costs and hitting context limits.

Sliding window

Keep only the N most recent turns. Simple and predictable, but loses older context entirely. Best for applications where recent context is most important (customer support, coding assistants).

Summarization

When the conversation grows long, call the API to summarize earlier turns, then replace them with the summary. Preserves semantics at much lower token cost. Adds latency at the summarization point.

Selective history

Use a retrieval step to include only the most relevant past turns, not all of them. Requires embedding the conversation turns and doing similarity search — adds complexity but scales to very long conversations.

from openai import OpenAI
import tiktoken

client = OpenAI()
enc = tiktoken.encoding_for_model("gpt-4o-mini")

MAX_HISTORY_TOKENS = 3000

def trim_messages_to_budget(messages: list, max_tokens: int) -> list:
    """
    Trim conversation history to stay within a token budget.
    Always keeps: system message + last user message.
    Drops oldest turns first.
    """
    if not messages:
        return messages

    system_msgs = [m for m in messages if m["role"] == "system"]
    other_msgs  = [m for m in messages if m["role"] != "system"]

    # Count tokens for required messages
    def token_count(msgs):
        return sum(len(enc.encode(m["content"])) + 4 for m in msgs)

    system_tokens = token_count(system_msgs)
    budget = max_tokens - system_tokens

    # Keep as many recent messages as fit, always keeping the last user message
    kept = []
    remaining = budget
    for msg in reversed(other_msgs):
        cost = len(enc.encode(msg["content"])) + 4
        if remaining - cost >= 0:
            kept.insert(0, msg)
            remaining -= cost
        elif not kept:
            # Always keep at least the most recent message
            kept.insert(0, msg)
            break

    return system_msgs + kept

def chat_with_budget(conversation: list, user_input: str) -> str:
    conversation.append({"role": "user", "content": user_input})
    trimmed = trim_messages_to_budget(conversation, MAX_HISTORY_TOKENS)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=trimmed,
        max_tokens=500
    )
    reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})
    return reply

The cost / quality / latency triangle

Every production AI decision involves trading off three variables. Improving one typically costs another.

Use a smaller model when the task is simple. gpt-4o-mini handles classification, summarization, and Q&A over well-structured text as well as gpt-4o at 10-17x lower cost.
Shorten your prompts by removing filler and redundancy. Shorter, cleaner prompts often get better responses anyway.
Cache aggressively for any repeated prompt prefix. A 5000-token system prompt sent 1000 times per day becomes 100 tokens per call after caching.
Batch where possible — some tasks (embedding generation, classification) can be processed in batches at 50% cost with OpenAI's batch API.
Set tight max_tokens for tasks where output length is predictable. Do not give a classification task 2000 output tokens.

The engineer's mindset on cost

Do not optimize prematurely, but do instrument from day one. Log every call's prompt_tokens, completion_tokens, model, and cost. Build a dashboard. You cannot optimize what you do not measure — and costs that look trivial in development can become thousands of dollars monthly in production.