Module 335 min read · Building with AI APIs

Chat Completions in Depth

The chat completions endpoint is the most important API surface in modern AI development. Understanding how it actually works — the roles system, how state is managed, the sampling parameters that control output behavior, and the full response structure — is the difference between building brittle demos and robust production systems.

The messages array: roles and their purposes

Every request to a chat completions API centers on the messages array. This is an ordered list of conversation turns, each with a role and content. There are three roles, and each serves a distinct function:

system — the instructions layer

Sets the persistent instructions, persona, and constraints for the model. This is where you define who the assistant is, what it can and cannot do, what format to use, and any background context it needs. System messages appear at the start of the conversation. Think of this as your developer-controlled layer — the user should never be able to override it through conversation.

user — the human turn

Represents input from the human side of the conversation. This is the actual question, instruction, or content the end user submitted. In automated pipelines, the "user" message might be machine-generated — it is simply the turn that the model responds to.

assistant — the model turn

Represents a previous response from the model. When you are building multi-turn conversations, you include previous assistant responses in the messages array so the model has context of what it has already said. You can also inject an assistant message yourself to "prime" the model's response style or pick up mid-sentence — a technique called assistant prefilling.

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        # System: the instructions layer — controls behavior globally
        {
            "role": "system",
            "content": (
                "You are a senior Python engineer at a fintech company. "
                "Give precise, production-quality answers. "
                "Always include error handling in code examples. "
                "Format code in Python 3.11+."
            )
        },
        # User: the human turn
        {
            "role": "user",
            "content": "How should I handle database connection pooling in a FastAPI app?"
        }
    ]
)

print(response.choices[0].message.content)

System message design patterns

Your system message is the most important prompt engineering lever you have. A well-designed system message can eliminate entire categories of incorrect behavior. Include:

Persona — who the assistant is and what expertise it has
Scope — what topics are in and out of bounds
Format instructions — response length, use of markdown, language
Background context — information the model needs that won't be in every user message
Safety rails — what the model should do when asked to go out of scope

Multi-turn conversation state management

This is the most important architectural fact about chat completion APIs: they are stateless. The API has no memory between calls. Every request is completely independent. When you want the model to remember previous turns, you must send the full conversation history yourself in the messages array.

This is not a limitation — it is a deliberate design choice that gives you complete control over what the model sees. But it means your application is responsible for maintaining conversation state.

from openai import OpenAI
client = OpenAI()

def run_conversation():
    """A multi-turn conversation loop with manual state management."""

    # The conversation history — starts with just the system message
    messages = [
        {
            "role": "system",
            "content": "You are a helpful Python tutor. Be encouraging and clear."
        }
    ]

    print("Python Tutor (type 'quit' to exit)")
    print("-" * 40)

    while True:
        user_input = input("\nYou: ").strip()
        if user_input.lower() in ("quit", "exit"):
            break
        if not user_input:
            continue

        # Add the user's message to history
        messages.append({"role": "user", "content": user_input})

        # Send the FULL history — this is what creates the appearance of memory
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            max_tokens=1000
        )

        assistant_reply = response.choices[0].message.content
        print(f"\nTutor: {assistant_reply}")

        # CRITICAL: add the assistant reply to history so future turns see it
        messages.append({"role": "assistant", "content": assistant_reply})

        # Optional: print token usage to monitor context growth
        total = response.usage.total_tokens
        print(f"  [Tokens used this turn: {total}]")

run_conversation()

Context windows fill up

Because you send the full history each time, longer conversations use more and more tokens — and cost more. A 20-turn conversation with detailed responses can easily consume 10,000–30,000 tokens per call. You must implement context management for production systems: truncating old turns, summarizing history, or using a sliding window. Module 4 covers this in detail.

Sampling parameters: controlling output behavior

The model does not mechanically produce the "best" next word — it generates a probability distribution over its entire vocabulary and samples from it. The sampling parameters let you control that sampling process.

temperature

Temperature controls how concentrated or spread out the probability distribution is before sampling.

temperature = 0 — deterministic

Always picks the highest-probability token. Given identical input, you get identical output every time. Use this for tasks where consistency matters: data extraction, classification, code generation, structured outputs. The model will still be creative — it just won't vary its choices between runs.

temperature = 0.7–1.0 — balanced

The standard range for conversational AI and general use. The model explores a wider set of options, producing varied, natural-sounding text. Most chat applications use 0.7–0.9. This is the default for most providers.

temperature > 1.0 — chaotic

Flattens the probability distribution, making even low-probability tokens likely. Output becomes unpredictable and often incoherent. Occasionally useful for creative tasks that need extreme novelty, but rarely appropriate for production systems.

top_p (nucleus sampling)

An alternative to temperature. Instead of scaling all probabilities, top_p restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. With top_p=0.9, the model only considers the top tokens that together account for 90% of the probability mass — ignoring the long tail of unlikely tokens.

OpenAI recommends changing one of temperature or top_p but not both. Most practitioners use temperature and leave top_p at its default of 1.0.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about APIs."}],

    # Sampling parameters — typically set one or the other
    temperature=0.9,   # Creative range: 0 (deterministic) to 2 (chaotic)
    top_p=1.0,         # Nucleus sampling: default 1.0 (consider all tokens)

    # Length control
    max_tokens=200,    # Hard cap on output tokens

    # Stop sequences: stop generating when any of these appear
    stop=["\n\n", "###"],
)

max_tokens vs max_completion_tokens

This is a common source of confusion caused by a naming inconsistency between providers:

Provider	Parameter name	Behavior
OpenAI (legacy)	max_tokens	Hard limit on output tokens. Deprecated in newer models.
OpenAI (o-series + gpt-4o-2024+)	max_completion_tokens	Replaces max_tokens. Covers both reasoning tokens and output tokens for o-series models.
Anthropic	max_tokens	Required field. Hard limit on output. No confusion — always max_tokens.
Google Gemini	maxOutputTokens	Same concept, different naming convention.

For OpenAI's o1/o3 reasoning models, max_completion_tokens covers both the hidden chain-of-thought reasoning tokens and the visible output — which is why the distinction matters. If you set a limit that is too low, the model may run out of "thinking budget" before producing an answer.

Stop sequences

The stop parameter accepts a string or list of strings. When the model generates any of these sequences, it stops — even if max_tokens has not been reached. The stop sequence itself is not included in the output.

This is useful when you want the model to fill in a template and stop at a delimiter, or when you are building a few-shot prompting system where you want the model to stop after one example rather than generating more.

# Example: structured extraction with stop sequences
# The model generates JSON and stops at the closing delimiter
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract entities. Return JSON only. Stop after the JSON object."},
        {"role": "user", "content": "Apple Inc reported $94.9B in revenue for Q1 2025."}
    ],
    temperature=0,      # deterministic for extraction
    stop=["```", "---"]  # stop at markdown fence or separator
)

The full response object structure

Understanding the complete response object — not just the content field — is essential for building reliable systems.

import json
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Name two planets."}],
    logprobs=True,         # enable token log probabilities
    top_logprobs=3         # return top 3 candidates at each position
)

# Top-level fields
print("id:", response.id)            # unique request ID for debugging
print("model:", response.model)      # exact model version served
print("created:", response.created)  # unix timestamp

# The choices array — usually just one choice
choice = response.choices[0]
print("finish_reason:", choice.finish_reason)  # "stop", "length", "content_filter", "tool_calls"
print("content:", choice.message.content)
print("role:", choice.message.role)  # always "assistant"

# Usage — for cost tracking and monitoring
usage = response.usage
print("prompt_tokens:", usage.prompt_tokens)
print("completion_tokens:", usage.completion_tokens)
print("total_tokens:", usage.total_tokens)
# Newer OpenAI models may include:
# usage.prompt_tokens_details.cached_tokens  (prompt cache hits)
# usage.completion_tokens_details.reasoning_tokens  (o-series hidden reasoning)

# Log probabilities (when logprobs=True)
if choice.logprobs:
    for token_info in choice.logprobs.content[:3]:  # first 3 tokens
        print(f"Token: {repr(token_info.token)!r:20} logprob: {token_info.logprob:.3f}")

finish_reason values and what they mean

Value	Meaning	Action
stop	Model finished naturally	Normal — use the output.
length	Hit max_tokens limit	Output may be truncated. Increase max_tokens or handle incomplete output.
content_filter	Provider policy blocked output	content field may be null. Inform the user appropriately.
tool_calls	Model wants to call a function	Execute the tool and send result back. See Module 8.
null	In-progress (streaming)	Final chunk will have a real finish_reason.

Logprobs for uncertainty estimation

When you enable logprobs=True, each output token comes with its log probability — the log of how confident the model was in that choice. This is a powerful signal for understanding model certainty and building confidence-aware systems.

A logprob of 0 means 100% confident (log(1) = 0). A logprob of -0.1 means very confident. A logprob of -5 means the model considered many alternatives. By examining logprobs on key tokens (like "Yes"/"No" for classification tasks), you can estimate confidence without asking the model to describe its own uncertainty.

import math
from openai import OpenAI
client = OpenAI()

def classify_with_confidence(text: str) -> dict:
    """Classify sentiment and return confidence score from logprobs."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Classify sentiment. Reply with only one word: Positive, Negative, or Neutral."},
            {"role": "user", "content": text}
        ],
        temperature=0,
        max_tokens=1,   # only need the first token
        logprobs=True,
        top_logprobs=3
    )

    choice = response.choices[0]
    label = choice.message.content.strip()

    # Convert logprob to probability
    logprob = choice.logprobs.content[0].logprob
    confidence = math.exp(logprob)

    return {
        "label": label,
        "confidence": round(confidence, 4),
        "top_alternatives": [
            {"token": t.token, "prob": round(math.exp(t.logprob), 4)}
            for t in choice.logprobs.content[0].top_logprobs
        ]
    }

result = classify_with_confidence("The product exceeded all my expectations!")
print(result)
# {"label": "Positive", "confidence": 0.9823, "top_alternatives": [...]}

Provider differences: a practical reference

The conceptual model is the same across providers, but field names and behavior differ enough to cause bugs when switching.

Anthropic: max_tokens is required

Unlike OpenAI, Anthropic requires max_tokens on every request — there is no default. The system prompt is a separate top-level parameter, not a messages entry. Response access is via message.content[0].text. Token usage uses input_tokens/output_tokens instead of prompt_tokens/completion_tokens.

OpenAI: max_completion_tokens on newer models

For gpt-4o and later models, max_completion_tokens is preferred over max_tokens. For o-series reasoning models, this is critical because it covers both reasoning tokens and output tokens. Always check which parameter the current model accepts.

Google Gemini: different message schema

Gemini uses "model" instead of "assistant" for the model's role. System instructions go in a separate system_instruction field. The response schema differs significantly. Gemini also has a native multimodal API where images, audio, and video are first-class inputs.

# Anthropic equivalent — same concepts, different API shape
import anthropic
client = anthropic.Anthropic()

# Multi-turn conversation with Anthropic
conversation_history = []

def chat_anthropic(user_message: str) -> str:
    conversation_history.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,          # REQUIRED — no default
        system="You are a helpful coding assistant.",  # separate param, not a message
        messages=conversation_history
    )

    assistant_text = response.content[0].text
    conversation_history.append({"role": "assistant", "content": assistant_text})

    # Anthropic usage fields
    print(f"Input: {response.usage.input_tokens} | Output: {response.usage.output_tokens}")

    return assistant_text

print(chat_anthropic("What is a decorator in Python?"))
print(chat_anthropic("Can you show me an example with functools.wraps?"))

The complete mental model

Every chat completion request is: a system message defining behavior + an ordered list of alternating user/assistant turns + sampling parameters controlling randomness. The API is stateless — you own the history. The response gives you content, a finish reason, token counts, and optionally log probabilities. With this model internalized, you can understand any chat completions API regardless of provider.