Hello everyone! I'm writing a blog about my experience on training a minimal DDPM and just want to share what I've learned so far. Feel free to read and discuss with me.
This question keeps popping up but I don't get it. Everyone and their dog has an OpenAI-compatible API. Why not just serve a local LLM and put api.openai.com 127.0.0.1 in your hosts file?
I mean why is that even a question? Is there some fundamental difference between the black box that is GPT-* and say, LLaMA, that I don't grok?
I think it cannot surpass SOTA in some LM evaluation sets, but please understand that achieving better results requires a very good training dataset, which not everyone can afford.
On the other hand, the main points of Zamba/Mamba are low latency, generation speed, and efficient memory usage. If this is true, LLMs could be much easier for everyone to use. All we need to do is wait for someone with a good training dataset to train a SOTA Mamba.
How can you retrieve the latent representation of the candidate LLMs? Some models do not have open weights (such as GPT-4), which means AFAIK it is impossible to directly access the hidden latent space through their API.
Does this mean LLMs can generate text with empty context? How can LLMs choose the first token without any previous tokens? My understanding is that to compute logits for the next token, LLMs require input from all previous tokens. Am I correct?