What Are Input vs Output Tokens โ and Why Do They Cost Different?
Every AI API invoice has two line items: input tokens and output tokens. Output almost always costs 4โ24x more per token than input. Here's the plain-English explanation of what tokens are, how the two types differ, and what it means for your monthly AI bill.
Input tokens are the words you send to the model (your prompt, context, documents). Output tokens are the words the model generates back. Output costs more because the model generates each output token sequentially โ one full computation per token โ while all input tokens are processed together in a single parallel pass.
What Is a Token?
Before exploring input vs output, it's crucial to understand what a token is. A token is the fundamental building block of data that Large Language Models process. Models don't read words letters-by-letter as humans do; they read sequences of tokens.
As a general rule of thumb for English text, 1 token is approximately 0.75 words (or about 4 characters). This means a 1,500-word blog post translates to roughly 2,000 tokens. Punctuation, spaces, and special coding characters often count as their own distinct tokens. Every API request is billed purely on how many tokens it consumes.
Input Tokens Explained
Input tokens (sometimes called prompt tokens) are everything you send to the AI model in your API request.
This total includes your System Prompt, the entire conversation history you inject back into the current turn, any Retrieval-Augmented Generation (RAG) documents you attach, and the user's specific query. When you make an API call, you pay for the model to "read" all of these input tokens in order to understand the context before it replies.
Output Tokens Explained
Output tokens (sometimes called completion tokens) are everything the AI model generates and sends back to you.
This is the actual answer, the summarised text, the generated code block, or the JSON object. You pay for the model to "write" these tokens. As you likely noticed on pricing pages, this "writing" process is significantly more expensive than the "reading" process.
Why Output Costs More: The Technical Reason
Processing input tokens is essentially a highly parallelised "reading" operation. GPU architectures can ingest and map the relationships of thousands of tokens simultaneously. This parallel efficiency is exactly why input tokens are cheap.
The key difference is that generation is an autoregressive process. The model must predict the next token, append it to the context, and then predict the next one.
Because output generation is sequential, it cannot be parallelised the way reading input can. The GPU must wait for token N before it can compute token N+1. This bottleneck requires significantly more computational time and energy per token, dictating the 4โ8x higher price premium.
The Input-to-Output Price Ratio Across Models
Every major AI provider prices output tokens at a premium. The ratio ranges from 1.5x (DeepSeek V3.2) to 8x (GPT-5). See our full GPT-5 vs DeepSeek V3 pricing comparison for model-by-model details.
DeepSeek V3.2's unusually low output:input ratio (1.5x vs the industry norm of 4โ8x) makes it especially cost-effective for generation-heavy tasks like writing, code generation, and long-form summarisation.
What This Means for Your Costs
The input/output split in your workload has a larger impact on your bill than the model you choose. A workflow that reads 10,000 tokens and outputs 50 words is almost entirely input-driven; one that outputs long essays from short prompts is almost entirely output-driven.
Typical Input:Output Ratios by Task Type
- Classification / sentiment tagging โ ~100:1 (read 1,000 tokens, output one label)
- Document summarisation โ ~10:1 (read 1,000 tokens, output 100)
- RAG-based Q&A โ ~4:1 (800 tokens of retrieved context, 200 words output)
- Customer support chatbot โ ~2:1 (short user message + history, output reply)
- Code generation / creative writing โ ~1:3 or 1:4 (short prompt, long generated output)
Since output tokens dominate cost in generation-heavy workflows, adding output length constraints to your prompts directly reduces your bill. Instructions like "Reply in under 150 words" or "Use bullet points only" can cut output token volume by 40โ60%. On GPT-5 at $10/M output tokens, every 1,000 tokens saved across 10,000 monthly runs is $10/month. Use the Burn Rate Calculator to model the impact.
Context Caching: Reducing Input Costs
Context caching stores a repeated prompt prefix at a heavily discounted rate. Both DeepSeek and OpenAI apply it automatically when they detect a repeated prefix โ no SDK changes required. For apps that prepend the same 2,000-token system prompt to every API call, caching can reduce effective input costs by 75โ90%.
Frequently Asked Questions
See Tokens in Context
Visualise what 2,000 tokens looks like in real-world terms โ tweets, books, codebases โ and compare live costs across models for your exact workload.
Open Token Visualizer โ