I've found Opus 4.6 to be smarter than 4.5, at least in some ways. There's a bug I'd been trying to solve for a decade (and so had other humans) and I've been giving it to each model to try and solve, including in interactive sessions. Each model got closer, but none of them actually solved it, until Opus 4.6 got it on the first go (I probably used Ultrathink). This was before the 1M context was available.
I'd agree that 4.6 and 4.5 are different, but I don't think it's correct that 4.6 is just reduced and benchmaxxed. It genuinely solved problems for me that no other model has been able to.
I think I'd like to have seen the 4.6 benchmarks also included against Qwen.