I specifically tested on tasks I designed because I know every modern model, not only local ones, are bechmaxxed. The common benchmarks most labs use are (very likely) in their datasets to a degree (I'm assuming unintentionally, but is still highly probable) and there was a recent report on how easy it is to actually cheat them, as shown by people at UC Berkeley https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
That is precisely why my testing has been daily driving the model for everything + 8 tasks in a domain I care about. Could there be something very similar in their datasets? Of course, at least for most of the tasks, but if that lead to the good performance experience and results I'm getting, I am personally ok with that. I don't care how high the numbers are on the common benchmarks, only if it works well enough for me.
And if this model doesn't work for you, that's perfectly ok. Everyone has different needs from models. I was just impressed that it did for me, as it was a first from a local model.
That is precisely why my testing has been daily driving the model for everything + 8 tasks in a domain I care about. Could there be something very similar in their datasets? Of course, at least for most of the tasks, but if that lead to the good performance experience and results I'm getting, I am personally ok with that. I don't care how high the numbers are on the common benchmarks, only if it works well enough for me.
And if this model doesn't work for you, that's perfectly ok. Everyone has different needs from models. I was just impressed that it did for me, as it was a first from a local model.