Skip to content
Writing
Thread

Why Anthropic banned third-party tools — it's all about the KV cache

A lot of people think Anthropic banning subscription credits for 3rd party tools is about throttling competition, but the real answer is clear if you're paying attention. It all comes down to KV caching.

BuildingAIPhilosophy23 posts
01

A lot of people think Anthropic banning subscription credits for 3rd party tools is about throttling competition, but the real answer is clear if you're paying attention. It all comes down to KV caching.

02

LLMs are transformer models and transformer models quite literally transform token inputs through attention layers. In every layer each token needs to "attend" to every other token. This process computes two matrices called Keys and Values for each token in the sequence. KV caching stores these matrices so the model doesn't have to recompute them from scratch every time.

03

Caching becomes particularly important because transformers generate tokens one at a time. Without caching, every new token requires reprocessing the entire conversation history. With caching, you store all previous Key/Value states and just append the new token's computations. It's the difference between rereading an entire book vs. picking up where you left off.

04

The scale of this is not trivial. For a frontier-scale model processing a 128k token context window, the KV cache alone can be 30-40GB of data. That's 30-40GB that either gets loaded from memory in milliseconds or recomputed from scratch on every single request. Multiply that by millions of users and you start to see why this is an infrastructure problem, not just an optimization.

05

This doesn't just apply within a single request. In a multi-turn conversation, all previous inputs and outputs become part of the context for the next message. So by turn 20 of a coding session, the model is processing the full history of everything you and it have said. Without cross-request caching, that entire context gets recomputed from scratch every single turn.

06

But KV caching isn't automatic. You have to explicitly declare cache breakpoints in your requests and structure follow-up messages so that the prefix of the conversation stays identical. Change anything in the cached portion and the whole thing gets invalidated. The cache has a TTL (5 min default, 1 hour if you pay more) and you get a maximum of 4 breakpoints. It's an engineering discipline, not a checkbox.

07

When you do it well, the savings are massive. Cached token reads cost 10% of base input price. For a multi-turn coding session where the system prompt and conversation history stay stable across turns, good caching can cut your effective token costs by 80-90%. That's the difference between a $200/mo subscription being viable and it being a money pit.

08

This is one of the core engineering priorities for the Claude Code team. They control the harness, they control how requests are structured, they control the cache breakpoints. Every interaction is optimized to maximize cache hits and minimize redundant computation. It's not incidental, it's a fundamental part of how they make the subscription economics work. It's so important, they declare SEVs when cache hit rates get too low.

09

Third party tools that aren't paying for their own tokens don't worry about this. It's not their tokens so they don't optimize for Anthropic's caching infrastructure. They structure requests differently, break cache prefixes, miss breakpoint opportunities. The result is they churn through tokens at dramatically higher rates for the same work. A coding session that costs X tokens through Claude Code could cost 5-10x through an unoptimized third party harness.

10

This strains Anthropic's infrastructure and breaks the economics of flat-rate subscriptions. It's also (I suspect) a big reason people have been complaining about subscription rate limits getting "nerfed." When a chunk of your user base is burning through tokens 5x faster than expected through tools you don't control, the math stops working, your infrastructure gets overwhelmed, and everyone's experience degrades.

11

Are competitive dynamics a factor? Probably. Anthropic doesn't want to become a commoditized backend for someone else's product. But the infrastructure argument is almost certainly the primary driver. When you have $200/mo subscribers generating $1,000+ in compute costs while your servers are melting, it stops being just about competitive strategy and becomes a survival problem.

Two days later, several variations of the same question rolled in — if Anthropic just wants their economics to work, why not bill me accordingly and let me burn through my allotment? — so I posted a follow-up digging into the subscription math and the infra context behind that decision.

12

I got several versions of this question on my KV caching thread and it's a legitimate one. The question is effectively "why do they care if you use your tokens inefficiently? just bill me accordingly and let me exhaust my allotment quickly." There's a few things going on beyond just caching.

13

The first major is reason is the subscription math. My best estimate is Anthropic needs average subscription usage to be ~30% of the limit for the economics to work. Which raises an obvious question: why not just set the limit at 30%? There's an inherent tension there around user experience, and I think there's two reasons the limit is set ~3x higher than what they need the average to be.

14
  1. they're trying to maximize perceived value. The subscription is built assuming episodic spikes where you're working intensely for a few hours and then cooling off. A higher ceiling accommodates that without most users actually hitting it consistently.
15
  1. they're subsidizing their heaviest users on purpose, because your super users are your evangelists. They're the ones building workflows, writing about Claude, pulling other people into the ecosystem. That subsidy is a growth investment.
16

The problem with OpenClaw is neither of those dynamics apply. When you're burning tokens inefficiently and indiscriminately, you max out your limits every single week. There's no spiky usage pattern to average down.

17

And instead of contributing to the ecosystem, OpenClaw is trying to abstract it away. In practice, it was really an API use case taking advantage of the subscription to get access to discounted rates.

18

The second major reason is infrastructure. Anthropic has strained capacity right now. They're seeing massive growth while still underallocated on inference. As a result, they're dealing with regular outages and struggling to keep pace with demand.

19

In that context, the trade-off becomes pretty clear. Instead of continuing to subsidize OpenClaw users who are burning through compute disproportionately, they decided to cut them off and reallocate that capacity towards users who are either already profitable or who they think will build the ecosystem long term.

20

There's one genuine tension here though. OpenClaw users are also likely to be the ones experimenting with new AI tools. They're the vanguard of adoption. Those are users you want in your ecosystem if you can keep them.

21

I think if the user base had been smaller, or if Anthropic wasn't dealing with severe infra constraints, they would have made that trade-off differently. But adoption got too widespread. There were too many playbooks, guides, and wrappers so it stopped just being early adopters and became a substantial drain on resources. I would still guess it was a tough call, but ultimately Anthropic needed to prioritize keeping the rest of their subscribers happy.

22

But, that's also why I think they ended up extending an expensive olive branch. By giving everyone their subscription's worth in extra usage and providing discounted extra usage to subscribers, they're trying to strike a balance point. It's their way of trying to keep those users in the ecosystem without subsidizing them as heavily as they were.

Originally on Threads