When an AI Gets Worse: The Hidden Bugs Behind Claude’s Slip

Have you felt like Claude suddenly got… a bit worse? Slower, forgetful, oddly less sharp? You’re not imagining it—and you’re definitely not alone.

Over the past weeks, a steady stream of developers and power users noticed something was off. Responses that used to feel precise started drifting. Long-running coding sessions became inconsistent. Sometimes Claude would “forget” what it was doing mid-task. At first, it looked like the usual noise of subjective feedback. But then Anthropic stepped in and confirmed: yes, something actually broke.

In a detailed postmortem (read it here), they traced the issue not to one failure—but three overlapping changes that interacted in surprisingly messy ways.

The first problem was almost philosophical: how much should a model “think”?

Claude Code exposes a knob called reasoning effort. Higher effort means more internal thinking, better answers, but longer latency. Lower effort means faster responses, but less depth. In early March, Anthropic quietly changed the default from high to medium. On paper, it made sense—internal testing showed only a small drop in intelligence but a big improvement in speed.

In reality, users noticed immediately. The model felt… shallower.

This is one of those subtle truths about AI systems: averages don’t tell the full story. Even if most tasks degrade only slightly, users tend to remember the moments where depth really matters. That’s where “slightly worse” becomes “noticeably worse.”

Anthropic eventually reversed the change in April, restoring higher effort as the default. But by then, it had already colored user perception.

The second issue is more technical—and more interesting.

Claude relies heavily on maintaining its internal reasoning across turns. That reasoning isn’t just for show; it’s part of how the system stays consistent over time. In late March, Anthropic introduced an optimization: if a session had been idle for over an hour, older reasoning would be cleared to reduce cost and latency when resuming.

The idea was sound. The implementation wasn’t.

A bug caused this cleanup to run not once, but on every subsequent turn. Imagine having a conversation where, after a break, the other person starts forgetting everything except their last sentence—over and over again.

That’s effectively what happened.

Worse, this compounded during tool use. If a user interacted while Claude was mid-task, even the current reasoning could get wiped. The result was a model that kept going—but without memory of why it made previous decisions. Users described it as forgetful, repetitive, even erratic.

There was also a side effect: more cache misses. Since the system kept discarding context, it couldn’t reuse previous computations efficiently. That likely explains reports of usage limits draining faster than expected.

It took over a week to isolate this bug. Not because it was obvious, but because it lived in a corner case—sessions that had gone idle—and was partially masked by unrelated internal changes.

The third issue feels almost ironic.

Claude Opus 4.7 had a known tendency: it was verbose. That verbosity often correlates with better performance on hard problems, but it also increases cost. To rein it in, Anthropic added a system-level instruction:

Keep responses under tight word limits.

On the surface, this looks harmless. In practice, it clipped the model’s ability to fully reason through complex coding tasks. Internal testing didn’t catch major regressions at first—but broader evaluations later showed measurable drops in performance.

A single line in a system prompt shaved off intelligence.

That’s a striking reminder of how sensitive these systems are. Small constraints, especially on output length, can indirectly limit reasoning depth. It’s not just about what the model says—it’s about how much room it has to think.

What makes this episode particularly interesting is how the three issues overlapped.

Different users experienced different symptoms depending on timing, model version, and usage patterns. Some saw reduced intelligence. Others saw memory problems. Others noticed increased costs. Together, it looked like a general degradation—even though no single root cause explained everything.

From the outside, it felt like the model had “gotten worse.”

From the inside, it was a stack of independent changes interacting in unintended ways.

Anthropic’s response is also worth noting.

They reverted the problematic changes, fixed the bug, and tightened their process. More real-world testing on public builds. Stronger controls around system prompt edits. Broader evaluation suites. Gradual rollouts for anything that might trade off intelligence.

They even mention something telling: when they revisited the buggy code using a newer model (Opus 4.7), it successfully identified the issue—while the older version didn’t.

In a quiet way, that underscores the trajectory here. The systems are improving—but the environments around them are getting more complex, and more fragile.

If there’s a takeaway, it’s this:

Modern AI systems aren’t just models. They’re layered products—interfaces, prompts, caching systems, defaults, and heuristics all interacting. A change in any one layer can ripple outward in unexpected ways.

So when something feels “off,” it might not be your imagination. And it might not even be the model itself.

Sometimes, it’s everything around it.

When an AI Gets Worse: The Hidden Bugs Behind Claude’s Slip

Related

The Build Log That Spoke to AI Agents

Half a Billion Dollar AI Blunder: The Hidden Costs of Unchecked Tech Spending

ECC v2.0: Elevating Agentic Work with Versatile Operator Systems and Open-Source Innovation

The Vulnerability Bottleneck Has Moved

China’s First Real Gaming GPU Is Here — And That Matters More Than FPS

Shai-Hulud and the Danger of Trusted Packages

When the Future Remembers First

YellowKey Turns BitLocker Into an Open Door

When Representation Beats Infrastructure