The workload we're costing
Calculations below assume a typical chatbot use case based on public production data (Discord bots, customer-support copilots, consumer chatbots):
| Variable | Value |
|---|---|
| Messages per active user per day | 5 |
| System prompt size | 1,500 tokens |
| Conversation history (avg, in window) | 1,000 tokens |
| User input per message | 80 tokens |
| Response per message | 300 tokens |
| Days/month per active user | 30 |
Adjust these in the calculator for your specific workload — most apps' values are within 30% of these defaults.
Cost ranking, with and without caching
Sorted by per-user-per-month cost with system prompt cached (which is what you should actually do):
| Model | Uncached | Cached | 10K users/mo |
|---|---|---|---|
| Ministral 3B | $0.0173 | $0.0173 | $172.80 |
| GPT-4.1 Nano | $0.0283 | $0.0227 | $227.25 |
| Llama 3.1 8B Instant (Groq) | $0.0229 | $0.0229 | $229.50 |
| GPT-5 Nano | $0.0373 | $0.0272 | $272.25 |
| Gemini 2.5 Flash-Lite | $0.0567 | $0.0398 | $398.25 |
| DeepSeek V4 Flash | $0.0668 | $0.0416 | $415.80 |
| Mistral Small 3 | $0.0522 | $0.0522 | $522.00 |
| GPT-4o Mini | $0.0851 | $0.0682 | $681.75 |
| GPT-5 Mini | $0.0934 | $0.0709 | $708.75 |
| Gemini 2.5 Flash | $0.229 | $0.178 | $1780 |
| Llama 3.3 70B (Groq) | $0.264 | $0.264 | $2639 |
| Claude Haiku 4.5 | $0.826 | $0.553 | $5528 |
Numbers in cents/dollars per user per month. The "10K users/mo" column is what you'd pay total for a chatbot serving 10,000 active users monthly at this workload.
Verdict by scale
Solo project / under 1,000 users
Pick on capability, not cost — the bill is too small to matter. At 1,000 users/month a Claude Sonnet 4.6 chatbot with caching runs ~$1658.48/month total. Use the model your evals say is best.
1,000–10,000 users
Cost starts to matter, but quality still matters more. Sweet spot: Gemini 2.5 Flash-Lite, GPT-5 Nano, or Claude Haiku 4.5. All three handle a routine chatbot well, all three are caching-friendly, all three keep monthly bills well under $1,000.
10,000–100,000 users
Cost dominates. Optimize hard: cache aggressively, cap output length, route obvious queries to small models, escalate only complex queries to the bigger ones. Top picks: GPT-5 Nano ($2723/mo for 100K users) or Gemini Flash-Lite.
100,000+ users
At this scale every cost decision compounds. Consider:
- Multi-model routing — cheap model for FAQ-style queries, escalate to GPT-5 Mini / Claude Sonnet for complex ones
- Volume-tier negotiation — Anthropic, OpenAI, and Google all do enterprise discounts above ~$10K/month spend
- Self-hosted Llama on dedicated GPUs becomes economically viable above ~$30K/month API spend
- Inference-side optimization — drop conversation history older than 10 turns, use smaller responses, structured outputs only when needed
The caching effect (why "cheapest" depends on what you cache)
For chatbots specifically, caching is the single biggest cost lever. The 1.5K-token system prompt is identical for every conversation, which means it's the perfect caching target. Anthropic gives 90% off cached reads; OpenAI gives 50% automatically.
Without caching, Claude Haiku 4.5 with the workload above costs about $0.826/user/month. With Anthropic 5-min caching: $0.553/user/month — a 33% cost reduction. Caching deep dive →
Common mistakes that blow up chatbot costs
- Sending entire conversation history every turn unbounded. After 50 messages, you're sending a 10K-token prompt on every reply. Set a sliding-window limit (last 10 messages, or last 5K tokens of history).
- Not caching the system prompt. See above — single biggest cost lever.
- Using a flagship model for FAQ-style queries. 90% of customer-support chatbot queries are routine. Route them to a small model and escalate the 10% that need it.
- No output length cap. Without
max_tokens, a verbose model can write 4,000 tokens when you wanted 200. Set the cap. - No daily/monthly per-user quota. A handful of users can run up your bill. Cap at e.g. 100 messages/day per user.
Ranking summary (cached cost, 5 messages/day workload)
The clear winners for chatbot use cases at scale:
- Ministral 3B — $0.0173/user/month
- GPT-4.1 Nano — $0.0227/user/month
- Llama 3.1 8B Instant (Groq) — $0.0229/user/month
- GPT-5 Nano — $0.0272/user/month
- Gemini 2.5 Flash-Lite — $0.0398/user/month
Don't pick on price alone — run a quality eval on your specific chatbot's expected conversations before committing. Even a 10% quality drop in a customer-support context costs more in escalations than it saves in API fees.
FAQ
What's the absolute cheapest LLM for a chatbot in 2026?
Per-user-per-month for a typical chatbot workload (5 messages/day, system prompt cached): GPT-5 Nano and GPT-4.1 Nano are tied at the bottom of the OpenAI family. Mistral's Ministral 3B undercuts them on raw token cost but lacks caching. For most apps, Gemini 2.5 Flash-Lite or DeepSeek V4 Flash hit the sweet spot of cheap + capable + cached.
How much does ChatGPT cost per user per month if I built it?
Depending on model: roughly $0.05–$5/user/month for active users sending ~5 messages/day. Free OpenAI ChatGPT users likely cost OpenAI ~$0.20–$1.00/month (heavily subsidized by Plus subscribers). Self-built with GPT-5 Nano + caching: ~$0.05/user. Self-built with Claude Sonnet 4.6 + caching: ~$1.50/user.
Why does cached pricing matter so much for chatbots?
Chatbots have very stable system prompts (the personality, behavior rules, tool definitions don't change between users). With caching, that 1.5K-token system prompt costs 10% of normal price after the first call. Multiply across millions of conversations and it's the difference between profitable and unprofitable.
Can I run a chatbot on Llama 3.1 8B for cheaper?
Yes, on Groq it's ~$0.05/M input and $0.08/M output — among the cheapest options. Quality is meaningfully lower than GPT-5 Mini or Gemini Flash, but for narrow chatbots (FAQ answering, simple lookup), the gap may be acceptable. Always run a quality eval on your actual conversations before deciding on cost grounds.
How do I budget for unexpected user behavior?
Three buffers: (1) cap conversation length — drop oldest messages once context exceeds 5K tokens; (2) hard token limit on output (max_tokens=500 for chat); (3) per-user daily quota — most users send 0-5 messages/day; outliers send 200+. A 100-message/day cap protects you from the long tail.