Module 335 min read · Building with AI APIs
Chat Completions in Depth
The chat completions endpoint is the most important API surface in modern AI development. Understanding how it actually works — the roles system, how state is managed, the sampling parameters that control output behavior, and the full response structure — is the difference between building brittle demos and robust production systems.
The messages array: roles and their purposes
Every request to a chat completions API centers on the messages array. This is an ordered list of conversation turns, each with a role and content. There are three roles, and each serves a distinct function:
system — the instructions layer
Sets the persistent instructions, persona, and constraints for the model. This is where you define who the assistant is, what it can and cannot do, what format to use, and any background context it needs. System messages appear at the start of the conversation. Think of this as your developer-controlled layer — the user should never be able to override it through conversation.
user — the human turn
Represents input from the human side of the conversation. This is the actual question, instruction, or content the end user submitted. In automated pipelines, the "user" message might be machine-generated — it is simply the turn that the model responds to.
assistant — the model turn
Represents a previous response from the model. When you are building multi-turn conversations, you include previous assistant responses in the messages array so the model has context of what it has already said. You can also inject an assistant message yourself to "prime" the model's response style or pick up mid-sentence — a technique called assistant prefilling.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
# System: the instructions layer — controls behavior globally
{
"role": "system",
"content": (
"You are a senior Python engineer at a fintech company. "
"Give precise, production-quality answers. "
"Always include error handling in code examples. "
"Format code in Python 3.11+."
)
},
# User: the human turn
{
"role": "user",
"content": "How should I handle database connection pooling in a FastAPI app?"
}
]
)
print(response.choices[0].message.content)
System message design patterns
Your system message is the most important prompt engineering lever you have. A well-designed system message can eliminate entire categories of incorrect behavior. Include:
- Persona — who the assistant is and what expertise it has
- Scope — what topics are in and out of bounds
- Format instructions — response length, use of markdown, language
- Background context — information the model needs that won't be in every user message
- Safety rails — what the model should do when asked to go out of scope
Multi-turn conversation state management
This is the most important architectural fact about chat completion APIs: they are stateless. The API has no memory between calls. Every request is completely independent. When you want the model to remember previous turns, you must send the full conversation history yourself in the messages array.
This is not a limitation — it is a deliberate design choice that gives you complete control over what the model sees. But it means your application is responsible for maintaining conversation state.
from openai import OpenAI
client = OpenAI()
def run_conversation():
"""A multi-turn conversation loop with manual state management."""
# The conversation history — starts with just the system message
messages = [
{
"role": "system",
"content": "You are a helpful Python tutor. Be encouraging and clear."
}
]
print("Python Tutor (type 'quit' to exit)")
print("-" * 40)
while True:
user_input = input("\nYou: ").strip()
if user_input.lower() in ("quit", "exit"):
break
if not user_input:
continue
# Add the user's message to history
messages.append({"role": "user", "content": user_input})
# Send the FULL history — this is what creates the appearance of memory
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
max_tokens=1000
)
assistant_reply = response.choices[0].message.content
print(f"\nTutor: {assistant_reply}")
# CRITICAL: add the assistant reply to history so future turns see it
messages.append({"role": "assistant", "content": assistant_reply})
# Optional: print token usage to monitor context growth
total = response.usage.total_tokens
print(f" [Tokens used this turn: {total}]")
run_conversation()
Context windows fill up
Because you send the full history each time, longer conversations use more and more tokens — and cost more. A 20-turn conversation with detailed responses can easily consume 10,000–30,000 tokens per call. You must implement context management for production systems: truncating old turns, summarizing history, or using a sliding window. Module 4 covers this in detail.
Sampling parameters: controlling output behavior
The model does not mechanically produce the "best" next word — it generates a probability distribution over its entire vocabulary and samples from it. The sampling parameters let you control that sampling process.
temperature
Temperature controls how concentrated or spread out the probability distribution is before sampling.
temperature = 0 — deterministic
Always picks the highest-probability token. Given identical input, you get identical output every time. Use this for tasks where consistency matters: data extraction, classification, code generation, structured outputs. The model will still be creative — it just won't vary its choices between runs.
temperature = 0.7–1.0 — balanced
The standard range for conversational AI and general use. The model explores a wider set of options, producing varied, natural-sounding text. Most chat applications use 0.7–0.9. This is the default for most providers.
temperature > 1.0 — chaotic
Flattens the probability distribution, making even low-probability tokens likely. Output becomes unpredictable and often incoherent. Occasionally useful for creative tasks that need extreme novelty, but rarely appropriate for production systems.
top_p (nucleus sampling)
An alternative to temperature. Instead of scaling all probabilities, top_p restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. With top_p=0.9, the model only considers the top tokens that together account for 90% of the probability mass — ignoring the long tail of unlikely tokens.
OpenAI recommends changing one of temperature or top_p but not both. Most practitioners use temperature and leave top_p at its default of 1.0.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about APIs."}],
# Sampling parameters — typically set one or the other
temperature=0.9, # Creative range: 0 (deterministic) to 2 (chaotic)
top_p=1.0, # Nucleus sampling: default 1.0 (consider all tokens)
# Length control
max_tokens=200, # Hard cap on output tokens
# Stop sequences: stop generating when any of these appear
stop=["\n\n", "###"],
)
max_tokens vs max_completion_tokens
This is a common source of confusion caused by a naming inconsistency between providers:
| Provider | Parameter name | Behavior |
| OpenAI (legacy) | max_tokens | Hard limit on output tokens. Deprecated in newer models. |
| OpenAI (o-series + gpt-4o-2024+) | max_completion_tokens | Replaces max_tokens. Covers both reasoning tokens and output tokens for o-series models. |
| Anthropic | max_tokens | Required field. Hard limit on output. No confusion — always max_tokens. |
| Google Gemini | maxOutputTokens | Same concept, different naming convention. |
For OpenAI's o1/o3 reasoning models, max_completion_tokens covers both the hidden chain-of-thought reasoning tokens and the visible output — which is why the distinction matters. If you set a limit that is too low, the model may run out of "thinking budget" before producing an answer.
Stop sequences
The stop parameter accepts a string or list of strings. When the model generates any of these sequences, it stops — even if max_tokens has not been reached. The stop sequence itself is not included in the output.
This is useful when you want the model to fill in a template and stop at a delimiter, or when you are building a few-shot prompting system where you want the model to stop after one example rather than generating more.
# Example: structured extraction with stop sequences
# The model generates JSON and stops at the closing delimiter
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract entities. Return JSON only. Stop after the JSON object."},
{"role": "user", "content": "Apple Inc reported $94.9B in revenue for Q1 2025."}
],
temperature=0, # deterministic for extraction
stop=["```", "---"] # stop at markdown fence or separator
)
The full response object structure
Understanding the complete response object — not just the content field — is essential for building reliable systems.
import json
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Name two planets."}],
logprobs=True, # enable token log probabilities
top_logprobs=3 # return top 3 candidates at each position
)
# Top-level fields
print("id:", response.id) # unique request ID for debugging
print("model:", response.model) # exact model version served
print("created:", response.created) # unix timestamp
# The choices array — usually just one choice
choice = response.choices[0]
print("finish_reason:", choice.finish_reason) # "stop", "length", "content_filter", "tool_calls"
print("content:", choice.message.content)
print("role:", choice.message.role) # always "assistant"
# Usage — for cost tracking and monitoring
usage = response.usage
print("prompt_tokens:", usage.prompt_tokens)
print("completion_tokens:", usage.completion_tokens)
print("total_tokens:", usage.total_tokens)
# Newer OpenAI models may include:
# usage.prompt_tokens_details.cached_tokens (prompt cache hits)
# usage.completion_tokens_details.reasoning_tokens (o-series hidden reasoning)
# Log probabilities (when logprobs=True)
if choice.logprobs:
for token_info in choice.logprobs.content[:3]: # first 3 tokens
print(f"Token: {repr(token_info.token)!r:20} logprob: {token_info.logprob:.3f}")
finish_reason values and what they mean
| Value | Meaning | Action |
| stop | Model finished naturally | Normal — use the output. |
| length | Hit max_tokens limit | Output may be truncated. Increase max_tokens or handle incomplete output. |
| content_filter | Provider policy blocked output | content field may be null. Inform the user appropriately. |
| tool_calls | Model wants to call a function | Execute the tool and send result back. See Module 8. |
| null | In-progress (streaming) | Final chunk will have a real finish_reason. |
Logprobs for uncertainty estimation
When you enable logprobs=True, each output token comes with its log probability — the log of how confident the model was in that choice. This is a powerful signal for understanding model certainty and building confidence-aware systems.
A logprob of 0 means 100% confident (log(1) = 0). A logprob of -0.1 means very confident. A logprob of -5 means the model considered many alternatives. By examining logprobs on key tokens (like "Yes"/"No" for classification tasks), you can estimate confidence without asking the model to describe its own uncertainty.
import math
from openai import OpenAI
client = OpenAI()
def classify_with_confidence(text: str) -> dict:
"""Classify sentiment and return confidence score from logprobs."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Classify sentiment. Reply with only one word: Positive, Negative, or Neutral."},
{"role": "user", "content": text}
],
temperature=0,
max_tokens=1, # only need the first token
logprobs=True,
top_logprobs=3
)
choice = response.choices[0]
label = choice.message.content.strip()
# Convert logprob to probability
logprob = choice.logprobs.content[0].logprob
confidence = math.exp(logprob)
return {
"label": label,
"confidence": round(confidence, 4),
"top_alternatives": [
{"token": t.token, "prob": round(math.exp(t.logprob), 4)}
for t in choice.logprobs.content[0].top_logprobs
]
}
result = classify_with_confidence("The product exceeded all my expectations!")
print(result)
# {"label": "Positive", "confidence": 0.9823, "top_alternatives": [...]}
Provider differences: a practical reference
The conceptual model is the same across providers, but field names and behavior differ enough to cause bugs when switching.
Anthropic: max_tokens is required
Unlike OpenAI, Anthropic requires max_tokens on every request — there is no default. The system prompt is a separate top-level parameter, not a messages entry. Response access is via message.content[0].text. Token usage uses input_tokens/output_tokens instead of prompt_tokens/completion_tokens.
OpenAI: max_completion_tokens on newer models
For gpt-4o and later models, max_completion_tokens is preferred over max_tokens. For o-series reasoning models, this is critical because it covers both reasoning tokens and output tokens. Always check which parameter the current model accepts.
Google Gemini: different message schema
Gemini uses "model" instead of "assistant" for the model's role. System instructions go in a separate system_instruction field. The response schema differs significantly. Gemini also has a native multimodal API where images, audio, and video are first-class inputs.
# Anthropic equivalent — same concepts, different API shape
import anthropic
client = anthropic.Anthropic()
# Multi-turn conversation with Anthropic
conversation_history = []
def chat_anthropic(user_message: str) -> str:
conversation_history.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024, # REQUIRED — no default
system="You are a helpful coding assistant.", # separate param, not a message
messages=conversation_history
)
assistant_text = response.content[0].text
conversation_history.append({"role": "assistant", "content": assistant_text})
# Anthropic usage fields
print(f"Input: {response.usage.input_tokens} | Output: {response.usage.output_tokens}")
return assistant_text
print(chat_anthropic("What is a decorator in Python?"))
print(chat_anthropic("Can you show me an example with functools.wraps?"))
The complete mental model
Every chat completion request is: a system message defining behavior + an ordered list of alternating user/assistant turns + sampling parameters controlling randomness. The API is stateless — you own the history. The response gives you content, a finish reason, token counts, and optionally log probabilities. With this model internalized, you can understand any chat completions API regardless of provider.