Why Claude Code Costs So Much
You're probably spending 2-3x more than you need to. Here's where the money goes and what to do about it. Special section for Claude Code Max & Pro subscribers.
~8 min readThe $600/month Problem
If you're using Claude Code daily, you've probably noticed: the bill adds up fast. A normal day of coding with Sonnet can cost $4-12. Heavy refactoring or agent-heavy sessions can hit $25+. Multiply by a month, and you're looking at $100-600 depending on how you work.
The frustrating part? Most of that cost is waste. You're paying full price for the same text — your project context, your settings, your tool definitions — on every single request. The API supports ways to avoid this, but they require specific headers and configurations that Claude Code doesn't set up for you.
| Scenario | What you pay today | What you should pay |
|---|---|---|
| Light day (50 requests) | ~$2.00 | ~$0.70 |
| Normal day (150 requests) | ~$4.50 | ~$1.50 |
| Heavy refactoring | ~$12.00 | ~$4.00 |
| Agent-heavy session | ~$25.00 | ~$10.00 |
The gap between those two columns? That's money you don't have to spend.
Where It Actually Goes
Every time you send a message in Claude Code, the API receives much more than your message. It receives your entire project context, conversation history, tool definitions, and any files Claude has read — all over again. You pay for every token, every time.
The biggest cost drivers
- Repeated project context — your CLAUDE.md and settings are sent with every request. If your project context is 2,000 tokens and you make 200 requests/day, that's 400,000 tokens of identical text you're paying full price for.
- Conversation history snowball — by turn 20, you're sending 50,000+ tokens of old messages. By turn 30, it's 140,000+. Each turn is more expensive than the last.
- MCP tool definitions — every MCP server you have installed contributes its full schema on every request. A few servers can add 50,000 tokens before Claude reads a single word you typed.
- Agent multiplication — sub-agents each start fresh with the full context. Three agents doing research = 3x the cost of one request.
Up to 80% of your input tokens on every request are identical repeated text. The API has ways to handle this cheaply, but the setup is complex and error-prone. Most developers never bother — and overpay by 2-3x as a result.
What You Can Do Today (Free)
These habits reduce your bill immediately, no tools needed:
- Use /clear between tasks — the single biggest saving. A fresh conversation costs a fraction of a 30-turn one. Switch topics? Clear first.
- Point Claude to specific files — "look at src/auth.ts" costs one file. "look at my code" costs ten. Be specific.
- Keep your CLAUDE.md lean — every word is sent on every request. If Claude would behave the same without a line, delete it.
- Remove MCP servers you don't use daily — they add token overhead to every request even when you don't call them.
- Ask for a plan before asking for code — reviewing a $0.05 plan is cheaper than undoing a $2.00 implementation that went the wrong way.
Good habits can save 15-25%. But you have to remember to do them every time, and they don't address the core issue: repeated context being billed at full price.
What You Can't Do Manually
The API supports optimizations that can cut your bill by 40-70%. But they require intercepting every request, modifying headers, analyzing prompt content, and making routing decisions in real time. This isn't something you can do from Claude Code's settings.
Anthropic's prompt cache is real and powerful: cached tokens cost 0.1× the normal rate. But to use it, every request must include explicit cache_control markers in the right places. Without them, nothing gets cached. And even when headers are present, the cache expires after 5 minutes of idle and resets whenever you switch models — including switching between Claude 3 tiers. In practice this means every coffee break, every /model command, and every new terminal session starts from a cold cache, billing full price for your CLAUDE.md and system prompt again.
Prefex injects the headers, extends cache lifetime to 1 hour, keeps the cache warm between your requests, and routes with the active model in mind so a cheap-task shortcut doesn't blow up an expensive cached prefix.
The optimization stack
| Tier | Feature | What it does | Typical savings |
|---|---|---|---|
| Cost Reduction | Prompt Cache | Anthropic has built-in prompt caching, but it requires explicit cache_control headers and expires after 5 min of idle. Prefex injects the headers, extends TTL to 1 hour, and keeps the cache warm automatically. | 30–70% |
| Smart Router | Sends simple requests to Haiku automatically; routes complex ones to Sonnet/Opus | 10–30% | |
| Compression | Strips verbose CLI output (git status, ls, stack traces) before it hits the API | 5–20% | |
| Context Management | Context Recall | Saves conversation history locally; resume sessions without re-sending full context | 5–15% |
| Write Gate | Prevents expensive turns from being written to session memory | varies | |
| Context Guard | Auto-prunes old messages when context approaches 75% fill | 5–15% | |
| Checkpoint | Summarizes and evicts old task brackets on long sessions | 30–50% on long sessions | |
| Guardrails | Loop Guard | Alerts when session spend is high; blocks requests at a configurable threshold | prevents runaway spend |
| Token Trap | Detects loop patterns and repeated errors before they compound | prevents runaway spend | |
| Think Cap | Controls extended thinking budget — enables deeper reasoning at 3× cost when useful | — | |
| Lean Reasoning | Instructs the model to reason briefly; saves 10–20% on explanation tokens | 10–20% |
These optimizations compound. Prompt caching alone saves 30–70%. Add routing and compression and the cost-reduction tier reaches 40–70% on its own. Context management extends how long sessions run within subscription limits. Guardrails prevent runaway spend on loops and expensive tasks.
Setting this up yourself means building a local proxy, parsing every request body, detecting cacheable prefixes, injecting the right headers, maintaining a session store, and building routing logic — all while ensuring zero downtime and fail-open behavior. If any of it breaks, your API calls fail. That's a full engineering project, not a weekend hack.
Claude Code Max & Pro: Extended Session Runtime
If you're on Claude Code Max or Pro (subscription plans with fixed token limits), Prefex solves a different but equally critical problem: extending how long your sessions can run.
The subscription token limit problem
- Max 5 — ~88,000 tokens per 5-hour window
- Max 20 — ~220,000 tokens per 5-hour window
- Weekly limits — cap resets weekly (exact number undisclosed)
- Everything counts — Claude Code + claude.ai + Claude Desktop all share the same pool
The problem: your CLAUDE.md and project context get counted fresh on every request. If your system prompt is 100K tokens and you make 50 requests in a session, you've already used 5M tokens just on repeated context — well over your 5-hour window before you've written a line of code.
How Prefex extends your runway
By caching your system prompt and routing simple requests to cheaper inference paths, Prefex dramatically reduces tokens billed against your subscription quota:
| Scenario | Without Prefex | With Prefex | Extra requests |
|---|---|---|---|
| 50 requests (100K system prompt) | 5M tokens (blocked after 10 min) | 0.6M tokens (full day possible) | +40 extra requests |
| 200 requests (large CLAUDE.md) | 20M tokens (blocked after 2 min) | 1.8M tokens (full session) | +180 extra requests |
Real example from a beta tester: After adding Prefex, this developer could sustain an 8-hour coding session without hitting the weekly limit. Previously, they hit the ceiling after 2 hours. Same subscription tier, same work — just smarter token usage.
Why subscription users benefit most
Unlike API plan users (who pay per token), subscription users have a fixed budget that runs out. Cost optimization isn't about saving money — it's about staying productive within your allocated window.
- Prompt cache — Anthropic's cache saves 90% on re-sent context, but expires after 5 minutes idle and resets if you switch models. Prefex keeps it perpetually warm.
- Routing — simple tasks (formatting, refactoring, questions) route to cheaper inference paths automatically
- Session management — redundant conversation history is trimmed, reducing bloat over long sessions
Combined, these optimizations typically extend session runtime by 300-500% before hitting limits.
You don't care about saving money — your fee is fixed. You care about how much work you can get done before hitting your quota. Prefex takes you from 1-2 hour sessions to full 8-hour days.
Real Before/After Numbers
These are from actual Claude Code sessions running through Prefex, measured via the local dashboard:
Example: 8-hour coding day
| Metric | Without Prefex | With Prefex |
|---|---|---|
| Requests | 180 | 180 |
| Input tokens | 2.4M | 2.4M (sent) / 0.8M (billed) |
| Cache hit rate | 0% | 88% |
| Routed to cheaper model | 0% | 34% |
| Total cost | $8.40 | $2.90 |
| Saved | $5.50 (65%) |
Monthly impact
| Developer profile | Monthly without | Monthly with | Annual savings |
|---|---|---|---|
| Part-time (3 days/week) | $60 | $21 | ~$470/year |
| Full-time solo dev | $200 | $70 | ~$1,560/year |
| Team of 5 | $1,000 | $350 | ~$7,800/year |
How Prefex Works
Prefex is a ~20MB binary that runs on your machine. You change one setting in Claude Code to route through it. From that point on, every API request passes through Prefex before reaching Anthropic.
// ~/.claude/settings.json — the only change
{ "env": { "ANTHROPIC_BASE_URL": "http://localhost:8019" } }
Prefex inspects each request, applies optimizations, and forwards it to Anthropic. If anything goes wrong internally, the request goes straight to Anthropic unchanged — your work is never interrupted.
What happens on each request
- Detects the client type (Claude Code, API, agent) and applies the right feature set
- Loads any prior session context from the local store
- Injects
cache_controlheaders so Anthropic caches the prefix at 1h TTL (vs the default 5-minute idle expiry); routes requests to preserve the cached model rather than invalidate it on a model switch — 0.1× cost on a hit - Evaluates complexity and routes simple requests to the cheaper model
- Compresses verbose CLI output before it reaches the model
- Forwards the request to Anthropic and streams the response back
- Logs tokens, cost, savings, and routing decision to local SQLite
Total overhead: 8–12ms. Your workflow doesn't change. Your API keys stay on your machine. Nothing is sent anywhere except Anthropic's API.
The dashboard
Open localhost:8019/dashboard to see real-time savings, per-project costs, cache hit rates, and routing decisions. It's the fastest way to understand where your money goes and confirm optimizations are working.
Try It (60 Seconds)
One script. Downloads the binary, starts it, connects Claude Code. Savings begin on the next request.
curl -fsSL https://prefex.vercel.app/install.sh -o install.sh bash install.sh
Then open your dashboard:
open http://localhost:8019/dashboard
- Free to use — 30-day license, renewable online. No credit card.
- Privacy — runs on localhost, no telemetry, prompts never logged. Full details →
- Reversible — uninstall removes everything and restores your original settings.
After installing, make a few Claude Code requests and check your dashboard. You'll see cache hits climbing and cost per request dropping within minutes. The community leaderboard shows what other developers are saving.