Token Pricing / Reference

LLM pricing comparison

Per-million-token prices across every major model. Click a column header to sort. Prices change every few months — most recently verified 2026-05-01; always confirm against the provider's pricing page for production budgeting.

Flagship Balanced Fast Reasoning
Model Provider Input / 1M Cached / 1M Output / 1M Context Tier
Ministral 3B
Cheapest Mistral; edge / on-device viable.
Mistral $0.0400 $0.0400 128K fast
GPT-5 Nano OpenAI $0.0500 $0.0050 $0.400 400K fast
GPT-4.1 Nano OpenAI $0.0500 $0.0250 $0.200 1,000K fast
Llama 3.1 8B Instant (Groq)
Cheapest small-model option, very fast on Groq.
Groq $0.0500 $0.0800 131.072K fast
Gemini 2.5 Flash-Lite Google $0.100 $0.0250 $0.400 1,000K fast
Mistral Small 3 Mistral $0.100 $0.300 128K fast
Llama 4 Scout (Groq)
Llama 4 mixture-of-experts; cheap and fast on Groq.
Groq $0.110 $0.340 128K balanced
GPT-5 Mini OpenAI $0.125 $0.0250 $1.000 400K fast
DeepSeek V4 Flash
Replaces deprecated DeepSeek V3. 1M context. Automatic disk-based context cache (no cache-write fee). Cache-hit price reduced to ~10% of cache-miss as of 2026-04-26.
DeepSeek $0.140 $0.0280 $0.280 1,000K balanced
GPT-4o Mini OpenAI $0.150 $0.0750 $0.600 128K fast
Grok 4 Fast
Strong price/performance: 2M context at $0.20/M input.
xAI $0.200 $0.0500 $0.500 2,000K fast
Llama 4 Maverick (Together)
Llama 4 Maverick larger MoE; 1M context.
Meta (via Together / Replicate) $0.270 $0.850 1,000K balanced
Gemini 2.5 Flash
Audio input is priced higher: $1.00/M.
Google $0.300 $0.0750 $2.500 1,000K balanced
Codestral
Code-specialized; 256K context for whole-repo prompts.
Mistral $0.300 $0.900 256K fast
GPT-4.1 Mini OpenAI $0.400 $0.100 $1.600 1,000K fast
Qwen 3.5 Plus (Alibaba) Alibaba (Qwen) $0.400 $2.400 128K balanced
GPT-3.5 Turbo
Legacy model, no prompt caching support. Still widely used for cost-sensitive routing.
OpenAI $0.500 $1.000 16.385K fast
Gemini 3 Flash Preview Google $0.500 $0.0500 $3.000 1,000K balanced
Mistral Large 3
Replaces Mistral Large 2 with significant price drop ($2/$6 → $0.50/$1.50).
Mistral $0.500 $1.500 128K balanced
Command R (Cohere) Cohere $0.500 $1.500 128K fast
o3-mini (reasoning)
Cached input price equals base input on this model (no caching discount). Hidden reasoning tokens.
OpenAI $0.550 $0.550 $2.200 200K reasoning
Llama 3.3 70B (Groq)
Same model as Together listing, hosted on Groq's LPU chips — ~2× faster, slightly cheaper.
Groq $0.590 $0.790 131.072K fast
Llama 3.3 70B (Together) Meta (via Together / Replicate) $0.880 $0.880 131.072K balanced
Claude Haiku 4.5 Anthropic $1.000 $0.100 $5.000 200K fast
Qwen 3 Max (Alibaba)
Qwen 3.6 Max; strong open-weights alternative widely deployed in 2026.
Alibaba (Qwen) $1.040 $4.160 256K flagship
o4-mini (reasoning)
Hidden reasoning tokens.
OpenAI $1.100 $0.275 $4.400 200K reasoning
GPT-5 OpenAI $1.250 $0.125 $10.00 400K flagship
Gemini 2.5 Pro
Tiered pricing: $2.50/M input, $15/M output above 200K tokens.
Google $1.250 $0.125 $10.00 2,000K flagship
DeepSeek V4 Pro (reasoning)
Replaces deprecated DeepSeek R1. Currently 75% off through 2026-05-31 → effective $0.435/$0.87. Hidden reasoning tokens.
DeepSeek $1.740 $0.145 $3.480 1,000K reasoning
GPT-4.1 OpenAI $2.000 $0.500 $8.000 1,000K balanced
o3 (reasoning)
OpenAI cut o3 prices substantially in 2025. Output includes hidden reasoning tokens — actual cost typically 4-10× output sticker.
OpenAI $2.000 $0.500 $8.000 200K reasoning
Gemini 3.1 Pro Preview
Tiered pricing: $4/M input, $18/M output above 200K tokens. Includes thinking tokens by default.
Google $2.000 $0.200 $12.00 1,000K flagship
GPT-5.4 OpenAI $2.500 $0.250 $15.00 1,000K flagship
GPT-4o OpenAI $2.500 $1.250 $10.00 128K balanced
Command A (Cohere)
Cohere's 2026 flagship at the same price as R+. Positioned for enterprise/RAG, not raw benchmark wins.
Cohere $2.500 $10.00 256K flagship
Command R+ (Cohere) Cohere $2.500 $10.00 128K balanced
Claude Sonnet 4.6 Anthropic $3.000 $0.300 $15.00 1,000K balanced
Grok 4 (reasoning)
xAI's 2026 flagship. Reasoning by default — output includes hidden reasoning tokens.
xAI $3.000 $0.750 $15.00 256K reasoning
Grok 3 xAI $3.000 $0.750 $15.00 131.072K balanced
Llama 3.1 405B (Together)
Open-weights flagship; falls behind 2026 frontier on benchmarks.
Meta (via Together / Replicate) $3.500 $3.500 131.072K flagship
Claude Opus 4.7
Adaptive thinking adds hidden reasoning tokens to output. New tokenizer uses ~35% more tokens than older Claudes — actual API cost will be ~35% higher than shown.
Anthropic $5.000 $0.500 $25.00 1,000K flagship
GPT-5.5 (reasoning)
Top of Artificial Analysis Intelligence Index v4. Hidden reasoning tokens — real cost typically 4-10× output sticker.
OpenAI $5.000 $0.500 $30.00 1,000K reasoning
Llama 3.1 405B (Cerebras)
Cerebras serves 405B at ~970 tok/s vs Together's ~50. Premium price for premium speed — 19× faster on the same model.
Cerebras $6.000 $12.00 128K flagship

How to read this table

Prices are USD per 1,000,000 tokens. Most modern APIs bill in fractions of cents — a typical chat-completion call costs $0.0005–$0.05 depending on length and model.

Input

What you pay per token of context sent to the model: system prompt, user message, conversation history, embedded documents, tool definitions. Anything counts as input if it goes into the prompt window.

Cached input

When the same prefix (system prompt, document, etc.) is reused across requests, providers offer big discounts on the cached portion. Anthropic and Google charge ~10% of input price for cached reads; OpenAI applies a 50% discount automatically. Caching is the biggest cost lever in any production app — see the deep guide.

Output

Per token of model response. Typically 3–5× input price. Reasoning models (o3, DeepSeek R1) include hidden reasoning tokens in this bucket — the visible response you see is only a fraction; you pay for the rest.

Context window

Maximum tokens the model can handle in a single request. Bigger is not always better — most models suffer "lost in the middle" beyond ~50K tokens, retrieving worse from the middle of long contexts than from the start or end. Plan for the smallest window that fits your task; you'll get better quality and lower latency.

Pricing tiers explained

The 4 cost-cutting moves that actually work

  1. Pick the right tier. A surprising amount of production traffic doesn't need a flagship model. Run an A/B on fast vs balanced before scaling.
  2. Cache the static prefix. System prompts, document context, and tool schemas don't change between turns — cache them. Caching is the single biggest lever in most apps.
  3. Constrain output. Use structured output (JSON mode, Zod schema, tool-call format) and ask for the shortest useful response. Output tokens are typically 3-5× input.
  4. Avoid reasoning models when you don't need them. The hidden token cost is real. For tasks that don't require step-by-step working, a non-reasoning model is 5-15× cheaper.

Provider pricing pages (verify before you commit)

← Back to the calculator