FundamentalsFeb 2026 · ·6 min read

What Are Input vs Output Tokens — and Why Do They Cost Different?

Every AI API invoice has two line items: input tokens and output tokens. Output almost always costs 4–24x more per token than input. Here's the plain-English explanation of what tokens are, how the two types differ, and what it means for your monthly AI bill.

⚡ Quick Answer

Input tokens are the words you send to the model (your prompt, context, documents). Output tokens are the words the model generates back. Output costs more because the model generates each output token sequentially — one full computation per token — while all input tokens are processed together in a single parallel pass.

What Is a Token?

A token is the smallest unit of text that an AI language model processes. It's not exactly a word — it's closer to a syllable or word fragment. The word "unhelpful" might split into un, help, and ful — three tokens. As a practical rule for English text: 1,000 tokens ≈ 750 words ≈ 3–4 pages of prose.

Common objects mapped to approximate token counts
ObjectApprox. TokensNotes
Single tweet (280 chars)~70Varies by content
1 A4 page of text~500Dense prose
1,500-word blog post~2,000Typical article
10,000 lines of Python code~80,000Medium codebase
Harry Potter Book 1~128,000~77K words
GPT-5 full context window400,000~3 full novels

Input Tokens Explained

Input tokens are everything you send to the model before it starts generating a reply: your system prompt (200–2,000 tokens typically), conversation history, injected context from RAG, and the user's message. The model reads all input tokens in parallel — a single forward pass processes every input token simultaneously, which is why they're priced lower. Use the AISpend Burn Rate Calculator to see how input volume affects your monthly bill.

Output Tokens Explained

Output tokens are the model's reply. Unlike input, output generation is fundamentally sequential and autoregressive: the model generates token 2 only after token 1 exists, and so on until it produces an end-of-sequence signal. A response of 500 tokens requires 500 separate model computations — each depending on the previous one, so they cannot be parallelised. That sequential GPU demand is why output tokens cost more.

Why Output Costs More: The Technical Reason

During input processing (the "prefill" phase), all tokens are handled in one parallelised matrix operation. During output generation (the "decode" phase), each token requires its own full model pass — the autoregressive decode loop.

// INPUT: all 2,000 tokens processed in ONE parallel pass
input_cost  = model.prefill(2000_tokens)  // ~1 forward pass

// OUTPUT: each of 500 tokens requires its OWN sequential pass
output_token_1 = model.decode(context)
output_token_2 = model.decode(context + token_1)
output_token_3 = model.decode(context + token_1 + token_2)
// ... x500 separate passes until [EOS]

The practical result: 500 output tokens require roughly as much GPU time as several thousand input tokens, which is why providers price output at a 4–8x premium.

The Input-to-Output Price Ratio Across Models

Every major AI provider prices output tokens at a premium. The ratio ranges from 1.5x (DeepSeek V3.2) to 8x (GPT-5). See our full GPT-5 vs DeepSeek V3 pricing comparison for model-by-model details.

ModelInput / 1MOutput / 1MOutput:Input RatioBest For
DeepSeek V3.2$0.28$0.421.5xGeneration-heavy tasks
GPT-5 Mini$0.40$1.604xBalanced workloads
GPT-5$1.25$10.008xLong context, agents
Claude Sonnet 4.6$3.00$15.005xComplex reasoning
GPT-5.2 Pro$21.00$168.008xFlagship / research

DeepSeek V3.2's unusually low output:input ratio (1.5x vs the industry norm of 4–8x) makes it especially cost-effective for generation-heavy tasks like writing, code generation, and long-form summarisation.

What This Means for Your Costs

The input/output split in your workload has a larger impact on your bill than the model you choose. A workflow that reads 10,000 tokens and outputs 50 words is almost entirely input-driven; one that outputs long essays from short prompts is almost entirely output-driven.

Typical Input:Output Ratios by Task Type

  • Classification / sentiment tagging — ~100:1 (read 1,000 tokens, output one label)
  • Document summarisation — ~10:1 (read 1,000 tokens, output 100)
  • RAG-based Q&A — ~4:1 (800 tokens of retrieved context, 200 words output)
  • Customer support chatbot — ~2:1 (short user message + history, output reply)
  • Code generation / creative writing — ~1:3 or 1:4 (short prompt, long generated output)

🔴 Cost Optimisation Tip

Since output tokens dominate cost in generation-heavy workflows, adding output length constraints to your prompts directly reduces your bill. Instructions like "Reply in under 150 words" or "Use bullet points only" can cut output token volume by 40–60%. On GPT-5 at $10/M output tokens, every 1,000 tokens saved across 10,000 monthly runs is $10/month. Use the Burn Rate Calculator to model the impact.

Context Caching: Reducing Input Costs

Context caching stores a repeated prompt prefix at a heavily discounted rate. Both DeepSeek and OpenAI apply it automatically when they detect a repeated prefix — no SDK changes required. For apps that prepend the same 2,000-token system prompt to every API call, caching can reduce effective input costs by 75–90%.

ProviderStandard Input / 1MCached Input / 1MCache Discount
DeepSeek V3.2$0.28$0.02810x cheaper
OpenAI GPT-5$1.25$0.314x cheaper
OpenAI GPT-5 Mini$0.40$0.104x cheaper

Frequently Asked Questions

What is the difference between input tokens and output tokens?

Input tokens are the text you send to the model — system prompt, conversation history, injected context, and the user message. Output tokens are the text the model generates in reply. Both are billed separately at different rates, with output typically 4–8x more expensive due to sequential computation requirements.

Why do output tokens cost more than input tokens?

Output tokens are generated sequentially — the model runs a full neural network forward pass for each token, and each pass depends on the previous one, so they cannot be parallelised. Input tokens are processed in one parallel pass. Generating 500 output tokens requires 500 separate GPU computations vs roughly 1 for 500 input tokens — hence the 4–8x premium.

How many tokens is 1,000 words?

Approximately 1,333 tokens for 1,000 English words, based on the rule of 1 token ≈ 0.75 words (~4 characters). A 1,500-word blog post is roughly 2,000 tokens. Actual counts depend on the model's tokenizer, language, and formatting.

What is context caching in AI APIs?

Context caching stores a repeated prompt prefix (system prompt, document, standard instructions) so subsequent API calls reuse it at a much lower rate. DeepSeek V3.2: $0.028/M cached input (10x cheaper than standard). OpenAI GPT-5: $0.31/M cached (4x cheaper). Caching only reduces input costs, not output.

How can I reduce my output token costs?

Add output length constraints to your prompts ("Reply in under 150 words", "Use bullet points only"). Choose models with a lower output:input ratio — DeepSeek V3.2's output rate is only 1.5x its input rate vs GPT-5's 8x. Route classification and short-answer tasks to smaller, cheaper models.

🧮 See Tokens in Context

Visualise what 2,000 tokens looks like in real-world terms — tweets, books, codebases — and compare live costs across models for your exact workload.

Open Token Visualizer →

Related Articles


Pricing data as of 19 Feb 2026. Token-to-word ratios are approximations for English text.