IONIAN Blog — AI Engineering

How Claude and Codex Actually Bill Tokens in 2026 (And Why Your AI Bill Changed)

The headline prices barely moved, but the meter changed: Claude's new tokenizer and flat 1M context, and Codex's switch to per-token credit metering with visible reasoning tokens. Here is the new math, cross-checked against the official docs.

How Claude and Codex Actually Bill Tokens in 2026 (And Why Your AI Bill Changed)

If you ship anything built on Claude or OpenAI Codex, your bill probably moved in 2026 — and not always for an obvious reason. The headline prices barely changed, yet two quiet shifts rewired how the meter actually runs: Claude switched to a new tokenizer and made its full 1M context standard, and Codex moved from per-message caps to per-token credit metering with reasoning tokens you can finally see.

This is the founder-and-engineer version of how token billing really works right now. The numbers below were cross-checked against the official Anthropic and OpenAI pricing and docs in June 2026. Pricing changes often — treat this as a working map, not a permanent contract, and confirm the live rate cards before you sign off on a budget.

The one mental model that survives every pricing change

Both providers bill the same fundamental way: per token, with input and output priced separately, and output costing far more than input. Across the current lineups, output runs roughly 5x the input rate. A "token" is about 4 characters of English, but the exact count depends on the model's tokenizer — which is exactly where 2026 got interesting.

Everything else — caching, reasoning, long context, tools — is a modifier on top of that base.

Claude: the price held, but the token count grew

Here is the current first-party Claude lineup (per million tokens, input / output):

Opus 4.8 / 4.7 / 4.6 — $5 in / $25 out
Sonnet 4.6 — $3 in / $15 out
Haiku 4.5 — $1 in / $5 out

Current Opus is about 3x cheaper than the old Opus 4.1 ($15 / $75), which is a real win. But there is a catch that surprises teams after they upgrade:

Opus 4.7 and later use a new tokenizer that can use up to ~35% more tokens for the same text. The per-token price did not rise, but the token count did — so the same prompt can cost more on 4.8 than it did on an older model. Do not paper over this with a blanket multiplier. Re-baseline real prompts with the count_tokens endpoint and budget from the actual number.

The good news that offsets it: the full 1M-token context window is now standard pricing. On Opus 4.8/4.7/4.6 and Sonnet 4.6 there is no long-context premium — a 900k-token request bills at the same per-token rate as a 9k-token one. The old "premium above 200K" tier is gone. For retrieval-heavy and long-document apps, that single change quietly reset what is affordable. (Haiku 4.5 stays at a 200K window.)

Caching is the lever that actually moves your Claude bill

Prompt caching is where the real savings live, and most teams under-use it. Relative to the base input rate:

Cache write (5-minute TTL): 1.25x
Cache write (1-hour TTL): 2x
Cache read (a hit): 0.1x

A cache read costs one tenth of normal input. On Opus 4.8 that is $0.50 per million cached-read tokens instead of $5. With the 5-minute cache you break even after roughly a single re-read, so any prompt with a stable prefix — a long system prompt, a knowledge base, a tool schema — should be cached. The minimum cacheable prefix is 1,024 tokens on Opus 4.8 and Sonnet 4.6 (4,096 on Haiku 4.5). Your usage response splits cache_creation and cache_read; the input_tokens field is only the uncached remainder.

The line items people forget

Thinking is billed as output — even when it's hidden. Extended/adaptive thinking tokens count as output tokens. They are the single biggest "where did my output bill come from" surprise.
Count tokens with the right tokenizer. The free count_tokens endpoint is model-specific. OpenAI's tiktoken is not built for Claude and noticeably under-counts it — do not budget Claude with it.
Tools add overhead. Tool definitions and the tool-use system prompt add a few hundred tokens of input before your content.
Images and search cost tokens (and dollars). An image runs about width × height / 750 tokens. Server-side web search is billed separately (around $10 per 1,000 searches), and code execution has a monthly free allotment before a per-hour container charge.

Codex: from message caps to a real token meter

The biggest conceptual change on the OpenAI side: "Codex" is no longer one model. It is OpenAI's coding agent — CLI, cloud agent, and IDE extension — running on a family of GPT-5.x models (GPT-5.5 as the flagship, plus GPT-5.4, GPT-5.4-mini, and GPT-5.3-Codex). And in 2026 it moved off vague per-message limits onto per-token credit metering, so usage finally maps to tokens the way the raw API does.

You pay for Codex in one of two ways:

ChatGPT plan credits (Plus, Pro, Business). Your Codex usage draws from plan limits — no separate per-token invoice — and the credit cost of a turn is derived from the underlying token usage.
API-key billing, at standard API token rates.

Current API rates (per million, input / cached input / output):

GPT-5.5 — $5 / $0.50 / $30
GPT-5.4 — $2.50 / $0.25 / $15
GPT-5.4-mini — $0.75 / $0.075 / $4.50
GPT-5.3-Codex — $1.75 / $0.175 / $14

Three things that decide your Codex bill

1. Caching is automatic and huge. Cached input is about a 90% discount (one tenth of the input rate), it turns on automatically once a prompt passes ~1,024 tokens, and it keys off a matching prefix — so put the stable, reusable context first and let repeated runs ride the cache. No code change required.

2. Reasoning tokens are hidden but billed as output. GPT-5.x reasoning tokens do not show up in the response text, but they occupy context and are billed as output, reported under reasoning_tokens. The reasoning-effort setting — none, minimal, low, medium, high, xhigh — directly scales how many of those tokens you generate. Medium is the GPT-5.5 default and a sane starting point; high/xhigh can multiply the cost of a single task. Leave roughly 25,000 tokens of headroom for reasoning plus output, or long tasks can truncate.

3. There's a long-context cliff at 272K. Push a GPT-5.5 prompt past 272,000 input tokens and the entire session reprices to 2x input and 1.5x output — effectively $10 in / $45 out. The window goes to ~1.05M tokens with 128K max output, but everything above that threshold is premium. Keep prompts lean, and lean on the stateful Responses API and Codex's automatic context compaction so you are not re-sending (and re-paying for) the whole history every turn. One gotcha: chaining responses still re-bills the prior input as input.

Old way vs. new way, in one breath

Claude, then: a long-context premium above 200K and fixed thinking budgets. Now: flat 1M-context pricing, adaptive thinking billed as output, and a model-specific tokenizer you should measure with count_tokens.
Codex, then: opaque per-message caps, invisible reasoning, one coding model. Now: per-token credit metering, reasoning billed as output with an effort dial, automatic ~90% input caching, and a whole GPT-5.x family.

How to actually spend less

Cache the stable prefix. On both platforms, putting your system prompt, schemas, and reference context first and caching it is the highest-leverage cost move there is.
Right-size reasoning effort. Don't run high reasoning on tasks that don't need it. The effort dial is a cost dial.
Re-baseline after every model upgrade. Especially on Claude, where the tokenizer changed. Measure real prompts; don't assume.
Respect the 272K cliff and the output multiplier. Trim context, compact history, and remember output (including hidden reasoning/thinking) is the expensive half.
Track cost per successful task, not cost per token. A model that's twice the token price but finishes the job in one pass is often cheaper in practice.

This is exactly the math we run before we quote an AI build. If you're trying to size what an AI feature or MVP will actually cost to operate, our MVP cost estimator is a fast first pass, and our services page covers how we design AI products to be cheap to run, not just impressive in a demo. Want a second set of eyes on your token spend or your build plan? Talk to us.

Figures verified against Anthropic and OpenAI pricing and documentation as of June 2026. Always confirm the current rate cards before committing a budget.