Eh, I can see their point, I think. The models can restate the rules differently, I'm sure, but it sounds like the GP is saying that LLMs can't tell whether the rules are well-balanced.
It would be interesting to see some example problems along those lines. Design some games with complex rules, including one or two of the most subtle game-wrecking bugs you can think of, and ask the models if they can spot them.
In fact that sounds more interesting the more I think about it. Intensive RL on that sort of thing might generalize in... let's say useful ways.
I would love to see examples but I think we won’t. I’m happy to be proven wrong that an llm will do worse than a fairly smart human (without prior experience in the board game).
It would be interesting to see some example problems along those lines. Design some games with complex rules, including one or two of the most subtle game-wrecking bugs you can think of, and ask the models if they can spot them.
In fact that sounds more interesting the more I think about it. Intensive RL on that sort of thing might generalize in... let's say useful ways.