Transformers' greatest improvement over RNN/LSTM was to enable better parallelization of large-scale training. This is what enabled language models to become "large". But when controlling for overall size, more RNN/LSTM-like approaches seem to be more efficient, as seen e.g. in state space models. The transformer architecture does add some notable capabilities in accounting for long-range dependencies and "needle in a haystack" scenarios, but these are not a silver bullet; they matter in very specific circumstances.
With modern training techniques, RNNs (not just linear SSMs, potentially even vanilla LSTMs) can scale just as well as transformers or even better when it comes to enormous context lengths. Dot-product attention has better performance in a number of domains however (especially for exact retrieval) so the best architectures are likely to remain hybrid for now.
>With modern training techniques, RNNs (not just linear SSMs, potentially even vanilla LSTMs) can scale just as well as transformers or even better when it comes to enormous context lengths.
That's not true. Modern training techniques aren't enough. Vanilla RNNs with modern training techniques still scale poorly. You have to make some pretty big architectural divergences (throwing away recurrency during training) to get a RNN to scale well. None of the big labs seem to be bothered with hybrid approaches.
> That's not true. Modern training techniques aren't enough. Vanilla RNNs with modern training techniques still scale poorly. You have to make some pretty big architectural divergences (throwing away recurrency during training) to get a RNN to scale well.
SSMs move the non-linearity outside of the recurrence which enables parallelisation during training. It is trivial to do this architectural change with an LSTM (see the xLSTM paper). Linear RNNs are still RNNs.
But you can still keep the non linearity by training with parallel Newtown methods, which work on vanilla LSTMs and scale to billion of parameters.
> None of the big labs seem to be bothered with hybrid approaches.
Does Alibaba not count? Qwen3.5 models are the top performers in terms of small models as far as my tests and online benchmarks go.
>SSMs move the non-linearity outside of the recurrence which enables parallelisation during training. It is trivial to do this architectural change with an LSTM (see the xLSTM paper). Linear RNNs are still RNNs.
Removing the non-linearity from the recurrence path is exactly what constitutes a "pretty big architectural divergence." A linear RNN is an RNN in a structural sense, certainly, but functionally it strips out the non-linear state transitions that made traditional LSTMs so expressive, entirely to enable associative scans. The inductive bias is fundamentally altered. Calling that simply 'modern training techniques' is disingenous at best.
>But you can still keep the non linearity by training with parallel Newtown methods, which work on vanilla LSTMs and scale to billion of parameters.
That does not scale anywhere near as well as Transformers in compute spend. It's paper/research novelty. Nobody will be doing this for production.
>Does Alibaba not count? Qwen3.5 models are the top performers in terms of small models as far as my tests and online benchmarks go.
I guess there's some misunderstanding here because Qwen is 100% a transformer, not a hybrid RNN/LSTM whatever.
> That does not scale anywhere near as well as Transformers in compute spend. It's paper/research novelty. Nobody will be doing this for production.
What exactly makes you so confident?
The world is not just labs that can afford billion dollar datacentres and selling access to SOTA LLMs at $30/Mtokens. Transformers are highly unsuitable for many applications for a variety of reasons and non-linear RNNs trained via parallel methods are an extremely attractive value proposition and will likely feature in production in the next products I work on.
> I guess there's some misunderstanding here because Qwen is 100% a transformer, not a hybrid RNN/LSTM whatever.
See the Qwen3.5 Huggingface description: https://huggingface.co/Qwen/Qwen3.5-27B
> Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.
Existing research? If you want something that scales as well as transformers you have to make the divergences I was talking about. If you don't then it scales a lot worse. The Newton methods don't match transformer efficiency at scale. That's just a fact.
>The world is not just labs that can afford billion dollar datacentres and selling access to SOTA LLMs at $30/Mtokens.
Billion dollar labs want to save money too. If Modern RNNs were a massive unanimous win, they and everyone else would switch in a heartbeat, just like they did for transformers. The reason they don't is because these architectures at best simply match transformers, while introducing their own architectural issues.