Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.



It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...


This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???


Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.


Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: