In the dropdown set to DeepSeek-R1, switch to the LIMO model (which apparently has a high frequency of language switching).
I'm not sure about examples of gibberish or totally illegible reasoning. My guess is that since R1-Zero still had the KL penalty, it should all be somewhat legible - the KL penalty encourages the model to not move too far from what the base model would say in any given context.
Seems like if you want to stay in the same language, you could just add a verifiable rewards term for that w/o having to fully load up on the baggage of a base model KL penalty.
Yep. And tbh you probably don't even have to do this; the R1 paper found that just running SFT the base model with a relatively small number of monolingual reasoning traces was enough for it to get the idea and iirc they didn't even bother selecting for language specifically in the RL training looop itself.
https://gr.inc/question/although-a-few-years-ago-the-fundame...
In the dropdown set to DeepSeek-R1, switch to the LIMO model (which apparently has a high frequency of language switching).
I'm not sure about examples of gibberish or totally illegible reasoning. My guess is that since R1-Zero still had the KL penalty, it should all be somewhat legible - the KL penalty encourages the model to not move too far from what the base model would say in any given context.