If it's relevant to the discussion, I hope not. I've spent probably over100 hour...

jaggs · 2026-04-08T10:10:16 1775643016

It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

BoorishBears · 2026-04-08T18:16:58 1775672218

This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???

jaggs · 2026-04-08T19:03:23 1775675003

Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.

BoorishBears · 2026-04-09T08:10:19 1775722219

Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.