Module 934 min read · Building with AI APIs

Production-Ready AI Apps

A working prototype and a production application are separated by a long list of concerns that have nothing to do with model quality: rate limits, authentication, cost controls, error recovery, latency, caching, content moderation, and observability. Developers who have only called AI APIs in notebooks are often surprised by how much engineering separates a working demo from a reliable service. This module covers the production layer — the problems you will hit when real users arrive, and the patterns that actually solve them.

Rate limits and exponential backoff

Every AI API enforces rate limits on two dimensions: requests per minute (RPM) and tokens per minute (TPM). In a prototype you make calls sequentially and rarely approach either limit. With concurrent users in production, you can hit both limits simultaneously and unpredictably. The correct response to a 429 Too Many Requests is exponential backoff with jitter — waiting progressively longer between retries while adding randomness to prevent synchronized retry storms from multiple clients.

The OpenAI and Anthropic Python SDKs both include built-in retry logic you can configure with max_retries on the client constructor. For custom retry logic, the pattern is: catch the rate limit exception, compute a wait time of (2 ** attempt) + random.uniform(0, 1) seconds, sleep, and retry up to a maximum attempt count. Never retry immediately on a 429 — you will just hit the limit again.

Beyond reactive retries, proactive rate limit management involves tracking your remaining rate limit headroom using the headers returned with each response (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens) and slowing down proactively before you hit the limit rather than after.

Caching to reduce cost and latency

Many production AI API calls are identical or near-identical. A customer support bot receives the same questions repeatedly. A document analysis pipeline processes the same document multiple times. Caching eliminates redundant API calls entirely.

Exact-match caching

Hash the complete request — system prompt plus user message plus model and key parameters — and store the response in Redis or Memcached with a TTL appropriate for how frequently the underlying data changes. Cache hits skip the API call entirely, returning stored responses in under 1ms. For applications with repetitive queries and stable knowledge bases, exact-match cache hit rates of 40-70% are achievable, which translates directly to cost and latency reductions.

Prompt caching (provider-side)

Anthropic's Claude API supports explicit prompt caching: if you mark a portion of your prompt with a cache_control parameter, the API caches the KV computation for that prefix and charges reduced rates for subsequent calls that share the same prefix. This is particularly valuable for applications with long, stable system prompts — a legal AI assistant with a 10,000-token policy document in its system prompt can have 90% of that computation cached, dramatically reducing per-call cost. OpenAI similarly caches system prompts automatically for prompts above 1,024 tokens.

API key security

API keys are credentials with spending authority. Exposed keys result in unauthorized charges — attackers find exposed keys systematically through GitHub scanning and exploit them within minutes of exposure. The rules are non-negotiable in production.

Never commit keys to version control
Store API keys in environment variables loaded from a .env file that is gitignored, or in a secrets management service (AWS Secrets Manager, Google Secret Manager, HashiCorp Vault). If you need to share secrets across a team, use a secrets manager — never a shared .env file in a repo.
Never expose keys to the browser
Client-side JavaScript that calls AI APIs directly necessarily exposes your API key to any user who opens the browser developer tools. All AI API calls must go through your backend. Your frontend calls your server; your server calls the AI API with the secret key. This is not optional — there is no safe way to make direct browser-to-AI-API calls with a secret key.
Set spending limits on your account
All major AI API providers allow you to set hard monthly spending caps. Set one. A runaway bug, an unexpected traffic spike, or a malicious user exploiting an endpoint without rate limiting can generate an unbounded bill in hours. A spending cap is the last line of defense.
Use per-environment keys
Separate API keys for development, staging, and production. Apply lower rate limits and spending caps to development keys. If a development key is leaked or a developer's machine is compromised, the blast radius is limited to the development key's spending cap rather than your production account.

Observability: logging and tracing

In a production AI application, something will go wrong — a model will produce an incorrect output, a user will report that "the AI said something weird," a cost spike will appear in your billing dashboard. Without structured logging of every API call, debugging these problems requires guesswork. With it, you can reconstruct exactly what happened.

At minimum, log for every API call: the timestamp, the model, the full prompt (system + messages), the full response, the token counts (prompt and completion), the latency, and any error information. Store these logs with a unique request ID that you can correlate with user sessions and backend traces. For longer-running tasks, log at each step so you can identify where a failure occurred.

Purpose-built AI observability tools — LangSmith (for LangChain-based apps), Arize Phoenix, Weights and Biases Weave, and Helicone — provide dashboards on top of this structured logging: cost over time, token usage by endpoint, latency percentiles, error rates, and the ability to search and replay past conversations. They integrate with most AI SDKs and add minimal overhead. For anything beyond a toy project, using one of these tools is worth the setup time — debugging without them is substantially more painful.

Content moderation and output validation

AI models produce outputs that can be incorrect, harmful, or outside your application's intended use case. A customer support bot can be prompted by a user to discuss topics unrelated to your product. A code generation tool can produce syntactically valid but functionally incorrect code. A document summarization tool can hallucinate details not present in the source.

Input moderation

OpenAI provides a free moderation endpoint (client.moderations.create) that classifies text along categories including hate speech, harassment, self-harm, sexual content, and violence. Run user inputs through the moderation endpoint before passing them to your main model. This catches the most obvious misuse patterns at minimal cost (the moderation endpoint is free and fast). It is not a complete solution — it catches high-confidence policy violations but misses subtle prompt injection and jailbreak attempts.

Output validation

For applications where output correctness is critical, validate model outputs programmatically before displaying them to users. If your application expects structured JSON, validate the schema. If it expects code, attempt to parse it. If it expects factual claims about your product, check them against a ground truth knowledge base. Structured outputs (from Module 5) make this validation tractable by constraining model outputs to valid JSON schemas — invalid outputs are caught at the API level rather than in your application code.

Per-user rate limiting

Without per-user rate limiting, a single user can exhaust your API quota for all users. Implement a token bucket or leaky bucket rate limiter at the user level, keyed to user ID or IP address, stored in Redis. Reject requests that exceed per-user limits with a 429 response and a Retry-After header. This prevents any single user from degrading the experience for others and limits the damage from prompt injection attacks that try to generate enormous amounts of tokens.

The cost accounting problem

If your application is multi-tenant and you want to charge users or customers for AI usage, you need per-request cost accounting from day one — not retrofitted later. The inputs are available in the API response: prompt tokens, completion tokens, and model. The output is a cost figure computed from published per-token pricing. Log this per request, aggregate by user and billing period, and build the billing UI on top. Retrofitting cost accounting into an application that was not designed for it is a significant engineering project; building it in from the start is an afternoon of work.

In the final module, we examine multi-provider strategies — how to architect AI applications that can use multiple model providers, and why that flexibility is more valuable than single-provider lock-in.