Around March 22–23, reports started flooding GitHub and Reddit: Claude Code was eating through Max plan session limits at a pace that made no sense. One issue thread described Max 5x users hitting their 5-hour session window in 90 minutes with the same workload that had been fine the day before. A Max 20x subscriber watched their usage counter jump from 21% to 100% on a single prompt. Anthropic eventually responded — but not with a bug fix. Team member Thariq Shihipar posted on X that the company was adjusting peak-hour session limits to manage demand, with about 7% of users seeing real impact. A reasonable explanation, except it didn’t fully account for the magnitude and suddenness of what people were experiencing.
The fuller explanation emerged a week later, in a GitHub issue filed March 29 by a developer named jmarianski who had done something unusual: he instrumented his Claude Code sessions to log every API call’s token breakdown — cache_read, cache_creation, input, output — timestamped to the second.
What he found is worth understanding in detail.
How Claude Code’s caching is supposed to work
Anthropic’s prompt caching works by letting you mark stable content with a cache_control breakpoint. On subsequent requests, if the prefix up to that breakpoint matches a previously cached version (within a 5-minute TTL by default, or 1-hour if you pay for it), the API serves those tokens from cache at roughly 10% of the normal input token cost. Cache reads are also excluded from ITPM rate limit calculations on most models, which compounds the benefit.
Claude Code uses this heavily. It caches its system prompt (~11–14k tokens), tool definitions, and — crucially — the growing conversation history. In a healthy long-running session, you’d expect to see something like this in the token logs: cache_read growing with each turn (the accumulated conversation), and cache_creation staying small (just the delta from the last turn). The pattern looks like an efficiently building wall.
What jmarianski observed instead was periodic collapse. At certain points in a session — reliably after a --resume operation, or seemingly at random after longer runs — cache_read would crater back down to ~11,000 tokens (just the system prompt), while cache_creation would spike to 300,000+ tokens. The entire conversation history was being re-written to cache from scratch, on every turn, for the rest of the session. Here’s a stripped excerpt from his logs that shows the cliff:
08:18:54 216204 10504 ... <-- cache starts breaking down
08:19:07 216204 11815 ...
...
08:23:22 11428 224502 ... <-- full cache write, history gone
08:24:47 11428 224953 ...
Those second-column numbers (cache_read) drop from 216k to 11k and stay there. Every subsequent API call now pays full price for a ~225k-token conversation that the server already had cached. With Opus 4.6 pricing, each of those turns costs around 25–50× more than it should.
Why the cache was dropping
The specific fix shipped in version 2.1.88 was: “Fixed prompt cache misses in long sessions caused by tool schema bytes changing mid-session.” That’s a compact description of a subtle and painful bug.
Here’s the mechanism. Anthropic’s cache lookup works by checking whether the content up to a breakpoint matches what was cached previously — the hash covers everything in that prefix: system prompt, tool definitions, and conversation turns. Claude Code places cache breakpoints strategically throughout the conversation to maximize hits as the history grows.
But if anything in the middle of that prefix changes between requests, the hash changes, and the lookup fails. The server doesn’t find an entry matching the new hash, so it falls back to writing a fresh cache entry. If it can’t even regenerate the cache for the full conversation at that point (perhaps due to some secondary constraint), you end up stuck: every subsequent turn pays full input token cost for the entire conversation.
The culprit turned out to be tool schema bytes changing mid-session. Claude Code’s tool definitions — the JSON schemas for all its available tools — apparently contained some dynamic content that could differ between requests, even within the same session. When that content changed, any cache entry that included tool definitions in its prefix became invalid. The longer the session (and thus the larger the cached conversation), the worse the financial damage.
There’s a related clue in the v2.1.86 release, which included: “Improved prompt cache hit rate for Bedrock, Vertex, and Foundry users by removing dynamic content from tool descriptions.” The same class of problem, partially addressed one version earlier for third-party platforms, but apparently not fully closed for the main product.
It’s also worth noting the earlier sibling issue, #34629, which reported that --print --resume sessions stopped caching conversation turns entirely starting around v2.1.69 — cache_read stuck at the 14.5k system prompt level and never growing, resulting in a ~20x cost increase per message for anyone building bots or automation on top of the CLI. That issue was open for weeks before the broader pattern became visible.
The opacity problem
What makes this class of bug uniquely harmful for Max plan subscribers is how Anthropic’s subscription pricing works. The Max plan charges a flat monthly fee ($100 or $200) in exchange for a session usage quota expressed as a multiple of Pro. But the actual token budget underlying that quota is opaque — Anthropic doesn’t publish it, and the usage dashboard just shows a percentage. You don’t get a breakdown of why you consumed that percentage.
Anthropic’s documentation explains only that “your usage is affected by several factors, including the length and complexity of your conversations, the features you use, and which Claude model you’re chatting with.” That’s true as far as it goes, but it means there’s no way to distinguish “I used a lot of tokens because I did a lot of work” from “I used a lot of tokens because a cache bug force-rewrote 300k tokens twelve times.”
The person who actually figured this out had to write a custom diagnostic tool and spend hours correlating timestamped API logs. That’s a reasonable thing for an engineer investigating a bug to do — but it shouldn’t be necessary for users who are just trying to understand why their $200/month plan ran out before lunch.
The broader context here isn’t great. This is at least the third notable cache-related regression in Claude Code since January 2026. The January incident around the Opus 4.5 launch involved similar symptoms: limits hit much faster than expected, community outcry, Anthropic investigating, eventually rolling back or patching something server-side. The March 22–23 wave coincided with yet another round of these reports and got partially attributed to deliberate peak-hour limit adjustments — which may have been true simultaneously, further muddying the diagnosis.
What 2.1.88 fixes
The fix is straightforward in concept: make tool schema bytes stable across requests within the same session. If the tool definitions don’t change between turns, the cache prefix they’re part of remains valid, the lookback window finds the existing cache entry, and subsequent turns pay only for the incremental new content. For a long session on Opus 4.6, this is the difference between each turn costing a few cents and each turn costing several dollars.
The release also fixes a cluster of other issues that, taken together, paint a picture of a codebase that’s been moving very fast: nested CLAUDE.md files being re-injected dozens of times in long sessions (another source of bloated token counts), a StructuredOutput schema cache bug causing ~50% failure rates in multi-schema workflows, and a memory leak in the LRU cache for large JSON inputs.
If you’re on Linux and using a version between roughly 2.1.69 and 2.1.87, the cost anomalies jmarianski documented are likely real and reproducible. Updating to 2.1.88 should fix the cache invalidation. If you want to verify rather than trust, the verification script from the original bug report is a clean way to check: start a session, write a unique string, verify cache_read grows on subsequent turns rather than flatlines at 11k.
The deeper fix — giving paying subscribers enough visibility into their token usage to actually diagnose when something is wrong — is a product decision, not a bug. But it’d go a long way.