Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The red and blue agents are effectively unlimited sources of true and false examples so you can get far more efficient scale than you can by pre training with labelled inputs. It’s also far more targeted on correct/incorrect rather than a notion of answer quality which doesn’t directly get at hallucination vs reality.


This is impressive, but what prevents the blue agent from generating an incorrect proof of a "true example"? What prevents the red agent from generating a correct disproof of a "false example"? I'm curious how they managed to generate a truly unlimited source of correctly labeled examples.


> "but what prevents the blue agent from generating an incorrect proof of a "true example"?

That's the role of the Verifier. It's not going to be perfect, and I'm sure some incorrect proofs of true examples slip through, but it's good enough to increase the quality of the model overall.

> "What prevents the red agent from generating a correct disproof of a "false example"?

And on the other side, it's counterbalanced by the rules engine (math) that can determine absolutely whether or not the right answer is given at the end.

The Red and the Blue agents are held in check by the tension between the math engine and the verifier, and they are free to fight back-and-forth within those parameters as long as they are able. Eventually, I think the Red agent loses the ability to attack effectively, and so that's the big limit on OpenAI's arrangement. This particular game isn't balanced enough for this training loop to continue infinitely.


But how do we know the answer you gave us wasn't generated by the sneaky prover? :)


At least in the context of this game, we essentially check the answer with a calculator (which the Verifier program doesn't have access to).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: