4.5 is better than 4.6 though in practice. 4.6 was purely a cost savings change ...

SyneRyder · 2026-04-03T10:09:10 1775210950

I've found Opus 4.6 to be smarter than 4.5, at least in some ways. There's a bug I'd been trying to solve for a decade (and so had other humans) and I've been giving it to each model to try and solve, including in interactive sessions. Each model got closer, but none of them actually solved it, until Opus 4.6 got it on the first go (I probably used Ultrathink). This was before the 1M context was available.

I'd agree that 4.6 and 4.5 are different, but I don't think it's correct that 4.6 is just reduced and benchmaxxed. It genuinely solved problems for me that no other model has been able to.

I think I'd like to have seen the 4.6 benchmarks also included against Qwen.

girvo · 2026-04-02T22:59:16 1775170756

Exactly. 3.6 plus in the exact same coding agent harness is notably worse in all of my testing compared to 3.5 plus.

The former gets stuck in ridiculous thought loops on the exact same tasks I’m testing. Fascinating really, I expected more for some reason.