Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

4.5 is better than 4.6 though in practice. 4.6 was purely a cost savings change with enough benchmark gamification to look better.


I've found Opus 4.6 to be smarter than 4.5, at least in some ways. There's a bug I'd been trying to solve for a decade (and so had other humans) and I've been giving it to each model to try and solve, including in interactive sessions. Each model got closer, but none of them actually solved it, until Opus 4.6 got it on the first go (I probably used Ultrathink). This was before the 1M context was available.

I'd agree that 4.6 and 4.5 are different, but I don't think it's correct that 4.6 is just reduced and benchmaxxed. It genuinely solved problems for me that no other model has been able to.

I think I'd like to have seen the 4.6 benchmarks also included against Qwen.


Exactly. 3.6 plus in the exact same coding agent harness is notably worse in all of my testing compared to 3.5 plus.

The former gets stuck in ridiculous thought loops on the exact same tasks I’m testing. Fascinating really, I expected more for some reason.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: