ChatGPT API Guide in 2026 (Developers' Practical Reference)

Q: How do I count tokens before making a request?

Use the tiktoken library in Python or js-tiktoken in Node. Get the encoding for your model (tiktoken.encoding_for_model("gpt-5")) and call .encode(text) — the length of the result is the token count. For chat-style messages, add ~4 tokens of overhead per message and 3 tokens for the response priming. The exact formula is in the OpenAI cookbook.

May 3, 2026·By UniLink Team

A practical guide — pricing, models, function calling, structured outputs, streaming, and prompt caching — for developers actually shipping with the OpenAI API.

The OpenAI API bills per token, not per seat — so a heavy ChatGPT Plus user paying $20/month often costs $1–$3/month on the API for the same volume, and a power user spending $300/month on stacked Plus seats can usually drop to $30 with a thin wrapper.
GPT-5 is the cheapest capable default in the lineup — $1.25 per million input tokens and $10 per million output, beating both GPT-4o and Claude Sonnet on price-per-quality for everyday tasks.
Function calling (now called "tools") is how you give the model real capabilities — database queries, API calls, code execution — without building a brittle parser around its prose output.
Structured outputs guarantee the model returns valid JSON matching your schema, eliminating the "prompt please return JSON" dance that defined 2023 codebases.
Prompt caching cuts input cost by 50% for any prefix you reuse — system prompts, RAG context, few-shot examples — and is automatic once you put the static content first.

If you are paying $300/month on stacked ChatGPT Plus seats so a small team can churn through long-context tasks, you are funding the wrong meter. A weekend project — a 200-line Python script wired to the OpenAI API with prompt caching, structured outputs, and a basic rate limiter — usually replaces that bill with a $20–$40 monthly invoice and gives you everything Plus users complain about losing: longer effective context, no message caps, no random throttling at 4 PM, and full control over which model handles which task.

This guide is the version a senior engineer would write for a junior teammate on day one — the actual mechanics of shipping with the OpenAI API in 2026. Models and pricing, the request shapes you will use ninety percent of the time, function calling, structured outputs, streaming, prompt caching, the Realtime API for voice, and the four mistakes every team makes on their first production deploy. No "AI is transforming software." Just the stuff you wish someone had told you before you billed your first dollar of tokens.

The 2026 context — what changed since GPT-4

Three shifts matter if your last hands-on with the API was during the GPT-4 Turbo era. First, GPT-5 is the default capable model and it is dramatically cheaper than the model it replaced — input tokens dropped from $5 per million on GPT-4o to $1.25 on GPT-5, and the model is materially better at multi-step reasoning and tool use. The old advice of "use GPT-3.5 for cheap stuff, GPT-4 for hard stuff" is obsolete; GPT-5 is now both the cheap default and the hard-stuff default.

Second, prompt caching went from beta curiosity to the single biggest cost-control lever in the API. Any prefix of 1,024 tokens or more that you reuse across requests is automatically cached at 50% off input price for cache hits. This means a RAG pipeline that crams 30,000 tokens of context into every call now bills the context once per five-minute window instead of once per call — which moves serious workloads from "we cannot afford this" to "this is cheaper than the database query."

Third, the Realtime API and Assistants v2 closed the gap with bespoke voice and agent stacks. You can stream bidirectional audio with sub-300ms latency over a single WebSocket, and Assistants v2 manages threads, file search, and code interpreter so you do not have to reinvent retrieval and tool orchestration on every project. For ninety percent of agent use cases, build-on-API now beats build-from-scratch.

Setup — keys, billing, and rate limits

You will not build anything until the boring infrastructure is correct. The setup story has three concrete steps that always trip first-time developers, so handle them in order before you write a single line of code.

Step 1 — Generate an API key the right way

Go to platform.openai.com/api-keys and create a project-scoped key, not a user-scoped one. Project keys can be revoked, rotated, and have their spend capped per project, which becomes critical the first time a misconfigured loop burns through $400 overnight. Never commit the key — put it in .env, add .env to .gitignore, and reference it as process.env.OPENAI_API_KEY or os.environ["OPENAI_API_KEY"]. The official SDKs read this variable automatically.

Step 2 — Set a hard spend limit before your first request

In Settings → Billing → Limits, set a usage limit ($50 is fine for prototyping) and a soft notification threshold at half that. Without this, a single accidental infinite loop in a streaming handler can run up four-figure bills before anyone notices. The limit is org-wide and OpenAI cuts you off when hit — annoying in production, life-saving in development.

Step 3 — Understand your rate-limit tier

New accounts start at Tier 1 — 500 requests per minute and 200,000 tokens per minute on GPT-5. You move up tiers automatically as you spend, capping at Tier 5 (10,000 RPM, 30M TPM). For most apps Tier 1 is plenty; if you are running batch processing, request a Tier upgrade after you hit $50 in spend. Hitting a rate limit returns a 429 with a Retry-After header — your code must respect it or you will be throttled harder.

Models and pricing — picking the right one

OpenAI publishes a dozen models but you will use three or four in practice. The table below cuts through the catalog and shows the price-per-quality picture as of 2026.

Model	Input ($/1M tokens)	Output ($/1M tokens)	Best for	What you give up
GPT-5	$1.25	$10.00	Default for nearly everything — reasoning, tool use, RAG, writing	Slightly slower than mini variants on simple classification tasks
GPT-5-mini	$0.25	$2.00	Classification, extraction, routing, low-stakes drafts at high volume	Weaker on multi-hop reasoning; do not use for agent loops
GPT-5-nano	$0.05	$0.40	Embedding-style classification, simple parsing, smoke-test traffic	Reasoning quality drops sharply; treat it like a fast regex-replacement
o3	$2.00	$8.00	Hard reasoning — math, debugging, multi-step planning where quality matters more than latency	Slow (10–60s typical), and overkill for 95% of requests
GPT-4o	$2.50	$10.00	Legacy code paths, multimodal vision tasks where GPT-5 does not improve outcomes	More expensive than GPT-5 with no quality advantage outside niche cases

The rule of thumb in 2026: start with GPT-5 for everything, profile your costs after two weeks, and only then route specific high-volume routes to mini or nano. The temptation to over-optimize for the cheapest model on day one always backfires — you ship a worse product and lose more in user friction than you save in tokens.

Basic completion — the request shape you will use most

Ninety percent of the API calls you write will look almost identical: a system message setting context, a user message with the request, and a response. Master this shape and the rest of the API is variations on the theme.

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "system", "content": "You are a concise technical writer."},
        {"role": "user", "content": "Explain prompt caching in two sentences."}
    ],
    temperature=0.3,
    max_tokens=200,
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Three details worth internalizing. temperature at 0.0 is deterministic and best for extraction, classification, and code generation; 0.7 is the right default for prose. max_tokens is a safety belt — set it even when you think you do not need to, because a runaway model continuing past your expected output is the most common cause of surprise bills. And always read response.usage; it is the only honest record of what the call cost.

Function calling (tools) — giving the model real capabilities

A naked language model is a fancy autocomplete. Function calling — now branded "tools" in the API — is what turns it into something useful. You declare the functions your code can run (with JSON Schema for arguments), the model decides when to call them and with what arguments, your code executes them, and the model uses the results to compose its final answer.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)

tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = get_weather(args["city"], args.get("unit", "celsius"))

The trap nearly every team falls into is treating tools as one-shot calls. They are not — the natural shape is a loop. The model calls a tool, you append the result to messages as a tool-role message, you call the API again, and the model either calls another tool or produces the final answer. Most production agent code is fifty lines of loop and four hundred lines of tool definitions. Start with one tool, prove the loop, then add the rest.

Structured outputs — guaranteed JSON that matches your schema

For years, getting reliable JSON out of an LLM meant pleading with it in the system prompt and parsing defensively. Structured outputs solved this with a constraint-decoding feature: the model is forced at the token level to emit only output valid against your JSON Schema. No more retry loops, no more "remove the markdown fence around the JSON" hacks.

from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str
    total_usd: float
    line_items: list[str]
    due_date: str

response = client.chat.completions.parse(
    model="gpt-5",
    messages=[
        {"role": "system", "content": "Extract invoice data from the user message."},
        {"role": "user", "content": invoice_text}
    ],
    response_format=Invoice,
)

invoice = response.choices[0].message.parsed
print(invoice.total_usd)

The Python SDK auto-generates the schema from a Pydantic model, which is why this snippet is six lines instead of forty. In TypeScript, use Zod with the helper from the official SDK and you get the same ergonomics. Once you have used structured outputs you will never write a JSON-extraction prompt again.

Streaming — making the UI feel fast

A user staring at a blank screen for eight seconds while the model finishes thinks your product is broken. Streaming fixes this by sending tokens as they are generated, so words appear in 200ms and the total wait is the same but the perceived wait is half. Pass stream=True and iterate over the response — the SDK handles the server-sent-events plumbing.

stream = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Write a short poem about caching."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

The catch nobody warns you about: when streaming, you do not get a final usage object by default. Pass stream_options={"include_usage": True} to get token counts in the last chunk, otherwise your cost telemetry will silently drop streaming requests and you will spend a week wondering why your dashboard says nothing happened.

Prompt caching — the 50% discount nobody activates

This is the biggest "free money" feature in the API and most teams ignore it because it is invisible. Any prefix of your messages array that is at least 1,024 tokens long and identical to a previous request within five minutes hits the cache and bills at 50% of the input price. The cache is automatic — you do not call any "cache" API. You simply put your stable content first.

Concretely: if your system prompt is 800 tokens of instructions plus 5,000 tokens of retrieved RAG context plus 200 tokens of few-shot examples, that 6,000-token prefix is identical across requests in a session. Send the user message last, and from the second request onward the 6,000 tokens cost half. For a chatbot with 10,000 sessions a day each making five turns, that is the difference between a $300/day bill and a $170/day bill — same product, same model, same answers, no code change beyond message order.

Realtime API — voice in production

The Realtime API is a WebSocket-based duplex audio channel that takes user speech in and emits model speech out with sub-300ms latency. It runs on a dedicated voice-tuned variant of GPT-5 and replaces the brittle three-stage pipeline (Whisper → GPT → TTS) most teams ran in 2024. You connect, you stream audio frames, you receive audio frames, and the model handles barge-in, interruption, and turn-taking natively.

It is priced higher than text — roughly $32 per million input audio tokens and $64 per million output — but the latency and quality are not comparable to anything you can build yourself. Use it for voice agents, accessibility features, and live translation. Do not use it for non-realtime transcription; Whisper is twenty times cheaper for that.

Common mistakes — what burns every team in production

No error handling around the API call. The OpenAI API returns 429 (rate limit), 500 (server error), and 503 (overloaded) regularly under load. Wrap every call in retry logic with exponential backoff and respect the Retry-After header. The official SDKs have max_retries built in — set it to 3 and move on. Code without retries will fail in production within the first week.

No rate-limit handling at the application layer. SDK retries handle transient bursts, but if your app sends 1,000 concurrent requests and you are on Tier 1 with a 500 RPM limit, you will spend ten minutes in 429 hell. Use a semaphore, a token-bucket queue, or a job system like Celery/BullMQ to cap in-flight requests below your tier limit.

Not counting tokens before sending. A request that exceeds the model's 400K context window does not gracefully truncate — it 400s. Use tiktoken to count tokens client-side before the call, and either truncate, summarize, or split the input. The five lines of token-counting you skip on day one cause the production incident on day forty.

Ignoring the context window in long conversations. Chat history grows linearly. By turn fifty you may be sending 50K tokens of history on every request, paying for it every time, and the model is paying attention to less than half of it. Implement a sliding window or a rolling summary by turn ten — your costs and quality will both improve.

FAQ

Is the OpenAI API cheaper than ChatGPT Plus?

For most individual users, yes — significantly. A heavy Plus user sending the equivalent of 200 long messages a day costs roughly $1–$3/month on GPT-5 via the API versus $20/month for Plus. The break-even is somewhere around 30 million input tokens and 3 million output tokens per month, which is far above what individuals consume. Plus is worth it for the official UI, voice mode, and integrations like Operator and DALL-E; the API is worth it for everything you build yourself.

Should I use the Assistants API or build my own loop on Chat Completions?

Build your own loop unless you specifically need file search, code interpreter, or persistent threads. Chat Completions with structured outputs and tool calling covers ninety percent of agent use cases, gives you full control over state, and is faster to debug. Assistants API is great for "I need an in-process retrieval engine and do not want to set up a vector DB" situations and not much else.

How do I count tokens before making a request?

Use the tiktoken library in Python or js-tiktoken in Node. Get the encoding for your model (tiktoken.encoding_for_model("gpt-5")) and call .encode(text) — the length of the result is the token count. For chat-style messages, add ~4 tokens of overhead per message and 3 tokens for the response priming. The exact formula is in the OpenAI cookbook.

What is the actual difference between GPT-5 and o3?

GPT-5 is a fast general-purpose model that produces output in 1–4 seconds. o3 is a reasoning model that "thinks" before answering — it generates internal reasoning tokens (which you pay for but do not see) before emitting the visible answer, making it slower (10–60s) but much better on hard math, code debugging, and multi-step planning. Use GPT-5 for production user-facing requests and o3 for offline analysis or hard problems where latency does not matter.

Can I fine-tune GPT-5?

Fine-tuning is available on GPT-5-mini and GPT-4o, not on the full GPT-5 model. In practice, ninety percent of fine-tuning use cases — voice and tone matching, format conformance, domain vocabulary — are now better solved with structured outputs and a strong system prompt. Fine-tune only if you have shipped a base-model version, profiled real failures, and have a labeled dataset of at least 500 examples that addresses those specific failures.

How do I keep API costs predictable in production?

Four levers, in order of impact: (1) put your stable content first to enable prompt caching; (2) route low-stakes calls to GPT-5-mini or nano; (3) cap max_tokens on every call; (4) implement a per-user rate limit so a single misbehaving client cannot run up the bill. Set OpenAI dashboard alerts at 50% and 90% of your monthly budget, and review the usage breakdown weekly during the first month — you will spot a wasteful pattern in the first review.

The bottom line

The OpenAI API in 2026 is mature in a way it was not in 2024. Pricing is low enough that hobby projects are essentially free, structured outputs eliminate the JSON-parsing tax, prompt caching makes long-context applications affordable, and the Realtime API closes the voice gap. The remaining differentiator is engineering discipline — the teams shipping faster are not the ones with the cleverest prompts but the ones with retries, token counting, structured outputs, and per-route model selection wired in from day one.

If you are still on stacked Plus subscriptions for things that should be code, or running 2023-era prompt engineering rituals to coerce JSON out of the model, you are leaving real money and real reliability on the table. The week you spend porting to the API pays back forever.

Key takeaways

Default to GPT-5 — it is the cheapest capable model and the right choice for most requests.
Use structured outputs instead of prompting for JSON — it is faster, cheaper, and never fails parsing.
Order your messages with stable content first to get automatic 50% prompt caching discounts.
Wrap every call in retries with exponential backoff and respect Retry-After headers.
Count tokens before sending and set max_tokens on every call to prevent surprise bills.
Use the Realtime API for voice — the legacy Whisper + GPT + TTS pipeline is obsolete.

Ship your AI product the right way

Built something on top of the OpenAI API and need a place to send users? UniLink gives you a one-link landing page with payments, lead capture, and analytics in five minutes — so the API key you just generated turns into a real product, not a localhost demo.

Start free on UniLink →