Eh, I can see their point, I think. The models can restate the rules differently...

Eh, I can see their point, I think. The models can restate the rules differently, I'm sure, but it sounds like the GP is saying that LLMs can't tell whether the rules are well-balanced.

It would be interesting to see some example problems along those lines. Design some games with complex rules, including one or two of the most subtle game-wrecking bugs you can think of, and ask the models if they can spot them.

In fact that sounds more interesting the more I think about it. Intensive RL on that sort of thing might generalize in... let's say useful ways.