I feel like we are getting closer and closer to a finding the Goldilocks Llm, that with some smarter training and the right set of parameters, will get us close to got 3.5 turbo performance but at a size, cost, and time effort that is significantly lower and that is runnable locally.
Combine that with what seems like every chip adding a neural engine and it feels like we are in the early days of high performance graphics again. Right now we are in the unreal engine voodoo era, where graphics cards/neural engines are expensive/rare. But give it a few generations and soon we can assume that even standard computers will have pretty decent NPUs and developers will be able to rely on that for models
Is this the level of performance people are relying upon? While I've always been impressed with the technology itself, it's only starting with GPT 4 that I think it approaches adequate performance.
For the work that I do (which is mostly rag with a little bit of content generation) GPT 3.5-turbo-0125 with a 16k context window is the sweet spot for me. I started using the api when it was only a 4k context window, so the extra breathing room provided by the 16k context window feels cavernous. Plus the fact that it's $0.50 per 1Million tokens means that I can augment my software with LLM capabilities at a cost that is attractive to me as a small time developer.
The way I rationalize it is that using 3.5-turbo is like programming on an 8-bit computer with Kilobytes of Ram, and gpt-4o is like programming on a 64bit computer with a 4080 ti and 32gb of ram. If I can make things work on the 8-bit system, they will work nicely on the more powerful system.
3.5-turbo performance wasn't very good though, and according to API statistic analysis it's a Nx7B model so it's already rather small. Ultimately Llama-3-8B is already better in all measurable metrics except multilingual translation, but that's not saying much.
It's not called anything until the lmsys leaderboard ranks it. Microsoft's blatant benchmark overfitting on Phi-2 makes for very little trust in what they say about performance. As a man once said, fool me once, shame on you, fool me twice-can't get fooled again.
Combine that with what seems like every chip adding a neural engine and it feels like we are in the early days of high performance graphics again. Right now we are in the unreal engine voodoo era, where graphics cards/neural engines are expensive/rare. But give it a few generations and soon we can assume that even standard computers will have pretty decent NPUs and developers will be able to rely on that for models