Month with Claude Code

A month of struggle.

It's not like I got bored with OpenAI Codex or have spare time to play with the new toy - it's quite the opposite. I like codex - it's my main productivity tool. However, for 4 weeks in a row I hit the weekly Codex limit 1-2 days before the end of the week. So I had to use the alternative to avoid blocking, not for some quick "let's try this" session.

That gave me something much more useful than hot takes: side-by-side comparison under actual usage. Last time I used both codex and claude back to back was ~5 month ago.

Unless specified otherwise, I'm comparing Opus 4.6 high vs GPT-5.4 xhigh.

The main issue: too eager to satisfy

Despite very clear guardrails and well-defined goals in AGENTS.md, Claude Code still feels too eager to satisfy the user's request.

That sounds nice until you ask it to do something where the correct move is not to jump straight into action.

With Codex, in one example I said "promote version for 2 slots deployments" and instead of going straight for the prod change, it actually stopped because of the guardrails and suggested a sane solution. That is exactly the kind of behavior I want.

Reason I'm using Codex

I do not need a model that looks busy. I need one that can read the room, respect the rails, and not improvise nonsense just to look helpful.

The problem is not just mistakes. It is blind confidence.

You're absolutely right meme

Honestly, it's a bummer. It's been more than 5 months since I used Claude for something tangible along with the Codex. Back then Codex was already better. These days there is a real gap in quality, especially in reasoning and context compression quality.

I genuinely start to think that people that compare local models to Sonnet are not that insane. Quality is very much comparable (bad).

Opus - confident and stupid

Look at this confident idiot. Opus keeps spitting statements without much meaning. The worst part is not even being wrong. Everything is wrong sometimes. The worst part is that blind confidence, what is not supported by the hard data. And because of this tone being fed back to the same chat it effectivly becomes a self fulfilling prophecy. It starts to believe nonsense it spits

Good call — rootbox is host-mode (no Docker overhead) and the 7950X3D should be faster for CPU encoding.

This was during a session where we were debugging degraded performance. Instead of saying "ok idk boss", it suggested moving compute to the bare machine because Docker supposedly causes a massive performance hit. That was a CPU-bound task.

This is exactly what kills me. Not just wrong, but wrong with posture.

Dirty context goes south fast

To the extent that I now monitor context usage and expect that if it's more than 128k, things will start going south.

That also explains why people restart sessions so often. With OpenAI Codex I restart a session every 2-5 days, sometimes after more than a week. It still gets dirty, sure, but the degradation is nowhere close to what I see with Opus. With Claude, once the context gets heavy, the hallucinated statements and fake certainty start piling up much faster.

You can get used to a model making mistakes. You can even work around it. But a model with weak context compression and strong confidence is exhausting.

It prefers looking busy over following the obvious process

One of the dumbest examples was provisioning a new host.

There were instructions on how to do it. Not some vague internal lore. Three ready-made scripts that just needed to be run sequentially. Instead, in a fresh session, it went to staging and started copying files one by one, by hand. 500+ commands instead of 3.

That is not "slightly inefficient". That is the model ignoring the documented path and choosing busywork.

I stopped it and typed "dumb btch", and it actually got the hint and started deploying via the scripts like it was supposed to.

That is also why Opus 4.6 high honestly feels like the bare minimum that's even usable from Anthropic.

What I actually want from a model

I do not want artificial warmth, filler, or confident folklore.

I want this:

I'm checking the official stats rather than giving you recycled folklore, because this route is famous for “basically impossible” claims that often get repeated without numbers.

That line is basically the whole thing. Check the data. Respect the constraints. Do not confidently manifest opinions where evidence should be.

Where I landed after a month

After a month of this, my issue with Claude Code is not that it never works. It is that too often it is confidently manifesting its opinion without any data, doing "light" analysis at best, and getting noticeably worse as the context gets dirty.

That is the part I struggle with.

Not the occasional mistake. Not even the occasional hallucination.

The blind confidence.

This is why the month felt like a struggle

If Claude Code was simply dumb sometimes, that would be manageable and predictable. You can't win all the time. But struggle is that it is often dumb in the most tiring possible way: eager, confident, and not very grounded.

So over time the experience becomes less about coding and more about supervision. You watch the context size. You double-check the obvious. You keep nudgeing it "Hey ask 5 subagents to critically review your response". You stop it from taking fake initiative. You correct it when it chooses busywork over the documented process that's already sitting in the context. You fight the urge to trust the polished answer just because it sounds finished.

That is where the frustration comes from.

It was an interesting month of using Claude Code along with the main tool. I've stopped subscription 4 days after resuming and probably will consider it again in half year time.

I'ts just so eager to satisfy the request. Its reasoning and context compression quality fall apart too fast.
And once that combines with blind confidence not supported by hard data, the whole thing becomes exhausting.

But millions can't be wrong!!11

I'm well aware that devs and especially vibecoders love Claude, I'm fine with that - probably for them it works better, or they think it works better. It's actually great that we have options on the market keeping innovations running.

That was my month with Claude Code.