The red and blue agents are effectively unlimited sources of true and false exam...

blueblaze0 · on July 18, 2024

This is impressive, but what prevents the blue agent from generating an incorrect proof of a "true example"? What prevents the red agent from generating a correct disproof of a "false example"? I'm curious how they managed to generate a truly unlimited source of correctly labeled examples.

HanClinto · on July 18, 2024

> "but what prevents the blue agent from generating an incorrect proof of a "true example"?

That's the role of the Verifier. It's not going to be perfect, and I'm sure some incorrect proofs of true examples slip through, but it's good enough to increase the quality of the model overall.

> "What prevents the red agent from generating a correct disproof of a "false example"?

And on the other side, it's counterbalanced by the rules engine (math) that can determine absolutely whether or not the right answer is given at the end.

The Red and the Blue agents are held in check by the tension between the math engine and the verifier, and they are free to fight back-and-forth within those parameters as long as they are able. Eventually, I think the Red agent loses the ability to attack effectively, and so that's the big limit on OpenAI's arrangement. This particular game isn't balanced enough for this training loop to continue infinitely.

Natsu · on July 18, 2024

But how do we know the answer you gave us wasn't generated by the sneaky prover? :)

HanClinto · on July 18, 2024

At least in the context of this game, we essentially check the answer with a calculator (which the Verifier program doesn't have access to).