Prefex — Why Claude Code Costs So Much (and How to Fix It)

01

The $600/month Problem

If you're using Claude Code daily, you've probably noticed: the bill adds up fast. A normal day of coding with Sonnet can cost $4-12. Heavy refactoring or agent-heavy sessions can hit $25+. Multiply by a month, and you're looking at $100-600 depending on how you work.

The frustrating part? Most of that cost is waste. You're paying full price for the same text — your project context, your settings, your tool definitions — on every single request. The API supports ways to avoid this, but they require specific headers and configurations that Claude Code doesn't set up for you.

Scenario	What you pay today	What you should pay
Light day (50 requests)	~$2.00	~$0.70
Normal day (150 requests)	~$4.50	~$1.50
Heavy refactoring	~$12.00	~$4.00
Agent-heavy session	~$25.00	~$10.00

The gap between those two columns? That's money you don't have to spend.

02

Where It Actually Goes

Every time you send a message in Claude Code, the API receives much more than your message. It receives your entire project context, conversation history, tool definitions, and any files Claude has read — all over again. You pay for every token, every time.

The biggest cost drivers

Repeated project context — your CLAUDE.md and settings are sent with every request. If your project context is 2,000 tokens and you make 200 requests/day, that's 400,000 tokens of identical text you're paying full price for.
Conversation history snowball — by turn 20, you're sending 50,000+ tokens of old messages. By turn 30, it's 140,000+. Each turn is more expensive than the last.
MCP tool definitions — every MCP server you have installed contributes its full schema on every request. A few servers can add 50,000 tokens before Claude reads a single word you typed.
Agent multiplication — sub-agents each start fresh with the full context. Three agents doing research = 3x the cost of one request.

The real issue

Up to 80% of your input tokens on every request are identical repeated text. The API has ways to handle this cheaply, but the setup is complex and error-prone. Most developers never bother — and overpay by 2-3x as a result.

03

What You Can Do Today (Free)

These habits reduce your bill immediately, no tools needed:

Use /clear between tasks — the single biggest saving. A fresh conversation costs a fraction of a 30-turn one. Switch topics? Clear first.
Point Claude to specific files — "look at src/auth.ts" costs one file. "look at my code" costs ten. Be specific.
Keep your CLAUDE.md lean — every word is sent on every request. If Claude would behave the same without a line, delete it.
Remove MCP servers you don't use daily — they add token overhead to every request even when you don't call them.
Ask for a plan before asking for code — reviewing a $0.05 plan is cheaper than undoing a $2.00 implementation that went the wrong way.

These help, but they're manual

Good habits can save 15-25%. But you have to remember to do them every time, and they don't address the core issue: repeated context being billed at full price.

04

What You Can't Do Manually

The API supports optimizations that can cut your bill by 40-70%. But they require intercepting every request, modifying headers, analyzing prompt content, and making routing decisions in real time. This isn't something you can do from Claude Code's settings.

Claude has a built-in cache — but it needs active management

Anthropic's prompt cache is real and powerful: cached tokens cost 0.1× the normal rate. But to use it, every request must include explicit cache_control markers in the right places. Without them, nothing gets cached. And even when headers are present, the cache expires after 5 minutes of idle and resets whenever you switch models — including switching between Claude 3 tiers. In practice this means every coffee break, every /model command, and every new terminal session starts from a cold cache, billing full price for your CLAUDE.md and system prompt again.

Prefex injects the headers, extends cache lifetime to 1 hour, keeps the cache warm between your requests, and routes with the active model in mind so a cheap-task shortcut doesn't blow up an expensive cached prefix.

The optimization stack

Tier	Feature	What it does	Typical savings
Cost Reduction	Prompt Cache	Anthropic has built-in prompt caching, but it requires explicit `cache_control` headers and expires after 5 min of idle. Prefex injects the headers, extends TTL to 1 hour, and keeps the cache warm automatically.	30–70%
	Smart Router	Sends simple requests to Haiku automatically; routes complex ones to Sonnet/Opus	10–30%
	Compression	Strips verbose CLI output (git status, ls, stack traces) before it hits the API	5–20%
Context Management	Context Recall	Saves conversation history locally; resume sessions without re-sending full context	5–15%
	Write Gate	Prevents expensive turns from being written to session memory	varies
	Context Guard	Auto-prunes old messages when context approaches 75% fill	5–15%
	Checkpoint	Summarizes and evicts old task brackets on long sessions	30–50% on long sessions
Guardrails	Loop Guard	Alerts when session spend is high; blocks requests at a configurable threshold	prevents runaway spend
	Token Trap	Detects loop patterns and repeated errors before they compound	prevents runaway spend
	Think Cap	Controls extended thinking budget — enables deeper reasoning at 3× cost when useful	—
	Lean Reasoning	Instructs the model to reason briefly; saves 10–20% on explanation tokens	10–20%

These optimizations compound. Prompt caching alone saves 30–70%. Add routing and compression and the cost-reduction tier reaches 40–70% on its own. Context management extends how long sessions run within subscription limits. Guardrails prevent runaway spend on loops and expensive tasks.

Why this is hard to DIY

Setting this up yourself means building a local proxy, parsing every request body, detecting cacheable prefixes, injecting the right headers, maintaining a session store, and building routing logic — all while ensuring zero downtime and fail-open behavior. If any of it breaks, your API calls fail. That's a full engineering project, not a weekend hack.

05

Claude Code Max & Pro: Extended Session Runtime

If you're on Claude Code Max or Pro (subscription plans with fixed token limits), Prefex solves a different but equally critical problem: extending how long your sessions can run.

The subscription token limit problem

Max 5 — ~88,000 tokens per 5-hour window
Max 20 — ~220,000 tokens per 5-hour window
Weekly limits — cap resets weekly (exact number undisclosed)
Everything counts — Claude Code + claude.ai + Claude Desktop all share the same pool

The problem: your CLAUDE.md and project context get counted fresh on every request. If your system prompt is 100K tokens and you make 50 requests in a session, you've already used 5M tokens just on repeated context — well over your 5-hour window before you've written a line of code.

How Prefex extends your runway

By caching your system prompt and routing simple requests to cheaper inference paths, Prefex dramatically reduces tokens billed against your subscription quota:

Scenario	Without Prefex	With Prefex	Extra requests
50 requests (100K system prompt)	5M tokens (blocked after 10 min)	0.6M tokens (full day possible)	+40 extra requests
200 requests (large CLAUDE.md)	20M tokens (blocked after 2 min)	1.8M tokens (full session)	+180 extra requests

Real example from a beta tester: After adding Prefex, this developer could sustain an 8-hour coding session without hitting the weekly limit. Previously, they hit the ceiling after 2 hours. Same subscription tier, same work — just smarter token usage.

Why subscription users benefit most

Unlike API plan users (who pay per token), subscription users have a fixed budget that runs out. Cost optimization isn't about saving money — it's about staying productive within your allocated window.

Prompt cache — Anthropic's cache saves 90% on re-sent context, but expires after 5 minutes idle and resets if you switch models. Prefex keeps it perpetually warm.
Routing — simple tasks (formatting, refactoring, questions) route to cheaper inference paths automatically
Session management — redundant conversation history is trimmed, reducing bloat over long sessions

Combined, these optimizations typically extend session runtime by 300-500% before hitting limits.

Subscription users: this is your killer feature

You don't care about saving money — your fee is fixed. You care about how much work you can get done before hitting your quota. Prefex takes you from 1-2 hour sessions to full 8-hour days.

06

Real Before/After Numbers

These are from actual Claude Code sessions running through Prefex, measured via the local dashboard:

Example: 8-hour coding day

Metric	Without Prefex	With Prefex
Requests	180	180
Input tokens	2.4M	2.4M (sent) / 0.8M (billed)
Cache hit rate	0%	88%
Routed to cheaper model	0%	34%
Total cost	$8.40	$2.90
Saved		$5.50 (65%)

Monthly impact

Developer profile	Monthly without	Monthly with	Annual savings
Part-time (3 days/week)	$60	$21	~$470/year
Full-time solo dev	$200	$70	~$1,560/year
Team of 5	$1,000	$350	~$7,800/year

07

How Prefex Works

Prefex is a ~20MB binary that runs on your machine. You change one setting in Claude Code to route through it. From that point on, every API request passes through Prefex before reaching Anthropic.

// ~/.claude/settings.json — the only change
{ "env": { "ANTHROPIC_BASE_URL": "http://localhost:8019" } }

Prefex inspects each request, applies optimizations, and forwards it to Anthropic. If anything goes wrong internally, the request goes straight to Anthropic unchanged — your work is never interrupted.

What happens on each request

Detects the client type (Claude Code, API, agent) and applies the right feature set
Loads any prior session context from the local store
Injects cache_control headers so Anthropic caches the prefix at 1h TTL (vs the default 5-minute idle expiry); routes requests to preserve the cached model rather than invalidate it on a model switch — 0.1× cost on a hit
Evaluates complexity and routes simple requests to the cheaper model
Compresses verbose CLI output before it reaches the model
Forwards the request to Anthropic and streams the response back
Logs tokens, cost, savings, and routing decision to local SQLite

Total overhead: 8–12ms. Your workflow doesn't change. Your API keys stay on your machine. Nothing is sent anywhere except Anthropic's API.

The dashboard

Open localhost:8019/dashboard to see real-time savings, per-project costs, cache hit rates, and routing decisions. It's the fastest way to understand where your money goes and confirm optimizations are working.

08

Try It (60 Seconds)

One script. Downloads the binary, starts it, connects Claude Code. Savings begin on the next request.

curl -fsSL https://prefex.vercel.app/install.sh -o install.sh
bash install.sh

Then open your dashboard:

open http://localhost:8019/dashboard

Free to use — 30-day license, renewable online. No credit card.
Privacy — runs on localhost, no telemetry, prompts never logged. Full details →
Reversible — uninstall removes everything and restores your original settings.

See it working

After installing, make a few Claude Code requests and check your dashboard. You'll see cache hits climbing and cost per request dropping within minutes. The community leaderboard shows what other developers are saving.