Production-Ready AI Apps
A working prototype and a production application are separated by a long list of concerns that have nothing to do with model quality: rate limits, authentication, cost controls, error recovery, latency, caching, content moderation, and observability. Developers who have only called AI APIs in notebooks are often surprised by how much engineering separates a working demo from a reliable service. This module covers the production layer — the problems you will hit when real users arrive, and the patterns that actually solve them.
Rate limits and exponential backoff
Every AI API enforces rate limits on two dimensions: requests per minute (RPM) and tokens per minute (TPM). In a prototype you make calls sequentially and rarely approach either limit. With concurrent users in production, you can hit both limits simultaneously and unpredictably. The correct response to a 429 Too Many Requests is exponential backoff with jitter — waiting progressively longer between retries while adding randomness to prevent synchronized retry storms from multiple clients.
The OpenAI and Anthropic Python SDKs both include built-in retry logic you can configure with max_retries on the client constructor. For custom retry logic, the pattern is: catch the rate limit exception, compute a wait time of (2 ** attempt) + random.uniform(0, 1) seconds, sleep, and retry up to a maximum attempt count. Never retry immediately on a 429 — you will just hit the limit again.
Beyond reactive retries, proactive rate limit management involves tracking your remaining rate limit headroom using the headers returned with each response (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens) and slowing down proactively before you hit the limit rather than after.
Caching to reduce cost and latency
Many production AI API calls are identical or near-identical. A customer support bot receives the same questions repeatedly. A document analysis pipeline processes the same document multiple times. Caching eliminates redundant API calls entirely.
Exact-match caching
Hash the complete request — system prompt plus user message plus model and key parameters — and store the response in Redis or Memcached with a TTL appropriate for how frequently the underlying data changes. Cache hits skip the API call entirely, returning stored responses in under 1ms. For applications with repetitive queries and stable knowledge bases, exact-match cache hit rates of 40-70% are achievable, which translates directly to cost and latency reductions.
Prompt caching (provider-side)
Anthropic's Claude API supports explicit prompt caching: if you mark a portion of your prompt with a cache_control parameter, the API caches the KV computation for that prefix and charges reduced rates for subsequent calls that share the same prefix. This is particularly valuable for applications with long, stable system prompts — a legal AI assistant with a 10,000-token policy document in its system prompt can have 90% of that computation cached, dramatically reducing per-call cost. OpenAI similarly caches system prompts automatically for prompts above 1,024 tokens.
API key security
API keys are credentials with spending authority. Exposed keys result in unauthorized charges — attackers find exposed keys systematically through GitHub scanning and exploit them within minutes of exposure. The rules are non-negotiable in production.
Observability: logging and tracing
In a production AI application, something will go wrong — a model will produce an incorrect output, a user will report that "the AI said something weird," a cost spike will appear in your billing dashboard. Without structured logging of every API call, debugging these problems requires guesswork. With it, you can reconstruct exactly what happened.
At minimum, log for every API call: the timestamp, the model, the full prompt (system + messages), the full response, the token counts (prompt and completion), the latency, and any error information. Store these logs with a unique request ID that you can correlate with user sessions and backend traces. For longer-running tasks, log at each step so you can identify where a failure occurred.
Purpose-built AI observability tools — LangSmith (for LangChain-based apps), Arize Phoenix, Weights and Biases Weave, and Helicone — provide dashboards on top of this structured logging: cost over time, token usage by endpoint, latency percentiles, error rates, and the ability to search and replay past conversations. They integrate with most AI SDKs and add minimal overhead. For anything beyond a toy project, using one of these tools is worth the setup time — debugging without them is substantially more painful.
Content moderation and output validation
AI models produce outputs that can be incorrect, harmful, or outside your application's intended use case. A customer support bot can be prompted by a user to discuss topics unrelated to your product. A code generation tool can produce syntactically valid but functionally incorrect code. A document summarization tool can hallucinate details not present in the source.
Input moderation
OpenAI provides a free moderation endpoint (client.moderations.create) that classifies text along categories including hate speech, harassment, self-harm, sexual content, and violence. Run user inputs through the moderation endpoint before passing them to your main model. This catches the most obvious misuse patterns at minimal cost (the moderation endpoint is free and fast). It is not a complete solution — it catches high-confidence policy violations but misses subtle prompt injection and jailbreak attempts.
Output validation
For applications where output correctness is critical, validate model outputs programmatically before displaying them to users. If your application expects structured JSON, validate the schema. If it expects code, attempt to parse it. If it expects factual claims about your product, check them against a ground truth knowledge base. Structured outputs (from Module 5) make this validation tractable by constraining model outputs to valid JSON schemas — invalid outputs are caught at the API level rather than in your application code.
Per-user rate limiting
Without per-user rate limiting, a single user can exhaust your API quota for all users. Implement a token bucket or leaky bucket rate limiter at the user level, keyed to user ID or IP address, stored in Redis. Reject requests that exceed per-user limits with a 429 response and a Retry-After header. This prevents any single user from degrading the experience for others and limits the damage from prompt injection attacks that try to generate enormous amounts of tokens.
If your application is multi-tenant and you want to charge users or customers for AI usage, you need per-request cost accounting from day one — not retrofitted later. The inputs are available in the API response: prompt tokens, completion tokens, and model. The output is a cost figure computed from published per-token pricing. Log this per request, aggregate by user and billing period, and build the billing UI on top. Retrofitting cost accounting into an application that was not designed for it is a significant engineering project; building it in from the start is an afternoon of work.
In the final module, we examine multi-provider strategies — how to architect AI applications that can use multiple model providers, and why that flexibility is more valuable than single-provider lock-in.