During training they gate with a lot of guardrails the format of the reasoning tokens output. They don't just use a reward for getting the correct answer during training but also reward human readable output. That said, if they didn't, the reasoning tokens that are the most efficient to get to the final correct answer during training would most likely look like a lot of gibberish.
There is a relationship between the tokens in the output in the model's vector space, that is the most important, and something hidden we will never see.
The agent makes a copy of itself in /tmp/. Runs. Evaluates. Updates itself. Makes a copy of itself. Runs. Evaluates. Updates itself. Makes a ...... you get the idea.
They will not stop if the recursion is given a hard to meet termination condition. Also, if it can cheat to solve the termination condition it will.
The red team adversaries are so effective. If Claude is blind to a bug, it won't surface using the same model from a red team adversary perspective. It requires using a different model which gpt-5.5 is great for. Yesterday I tried for the first time using gpt-5.5 as a adversary against the tests themselves. Later I thought it would be interesting to create a trickster agent which breaks the code after copy entire project into /tmp/ in order to control every aspect of it. Claude insists this called mutation testing. It would create regressions and then run all the tests. Finally it was able to unsupervised create an effective test harness.
I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly.
The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples.
That said, using lambda calculus is a brilliant subject for benchmarking.
Why 45 times in particular? If you want 80% power to distinguish a model at 50% from a model at 51%, you need 39,440 samples per model, or 329 samples per question per model. But that would just give you a more precise estimate of how well the model does on those 120 questions in particular. If you want a more precise estimate of how well the model might do on future questions you come up with, you'll need to test more questions, not just test the same question more times.
I made flame charts of sonnet thinking. [0] You can see there is a lot of variance over 5 runs. They all passed but there was one that struggled with errors. How many trials are needed to clamp to ceiling or floor? ~30?
How many samples you need depends on the difference you want to be able to measure (0% to 1% is different from 50% to 51% is different from 0% to 10% is different from 50% to 60%), the significance level at which you will declare a difference (conventionally, p < 0.05) and how likely you want this to happen when there is indeed such a difference (statistical power, conventionally 80%). Of course you can also just sample an arbitrary number of times and compute confidence intervals after the fact, but doing a statistical power computation helps clarify what it is you want to know, how certain you want to be, and whether you can realistically achieve such knowledge with the budget you have.
To solve the lambda calculus problem Sonnet burns 8,163 - 17,334 tokens on 5 runs.
If I want to engineer a prompt, starting with the tokens which are clearly better in the one with 8,163 will yield a better agent.
If I build an agent that does something arbitrary like reverse engineer any website or multiplies 2 large numbers without a tool that allows it to use code, the mechanics of the reasoning work the same as an agent solving lambda calculus. Running 39,440 trials is prohibitory expensive. Nonetheless, without perfect proof, I want to say running an agent several times and then take any generalized output from the fastest runs yields much faster generalized agent that solves that specific task given different parameters.
That is something I really want to know. If I have an agent that reverse engineers websites, can I take the thinking output from the best running and use that to seed a better agent? I don't know how to set up the experiment. And asking ChatGPT has been futile especially and running it is very expensive. How do I set up that experiment?
You could try a sequential testing setup, which can let you stop the experiment earlier if the difference is larger than expected. But if the difference is small, there's no way around the fact that reliably detecting small differences requires large sample sizes, and the relationship is inverse quadratic (halving the smallest detectable difference quadruples the sample size you need).
Well, to be fair, people cheat by remembering what they did last time. I think the idea here is to run the models from a "clean slate" and see how often they succeed/fail.
They are, like people, non-deterministic, so giving them several "fair" trials makes sense to me.
> there is no quantitative measure of performance here
Have them do multiplication or other complicated arithmetic. You say that isn't difficult. Then why do they burn 200k tokens in 20 minutes without converging? I did a deep exploration to help myself understand here [0].
LLM models and the agents that use them are probabilistic, not deterministic. They accomplish something a percentage of the time, never every time.
That means the longer an agent runs on a task, the more likely it will fail the task. Running agents like this will always fail and burn a ton of token cash in the process.
One thing that LLM agents are good at is writing their own instructions. The trick is to limit the time and thinking steps in a thinking model then evaluate, update, and run again. A good metaphor is that agents trip. Don't let them run long enough to trip. It is better to let them run twice for 5 minutes than once for 10 minutes.
Give it a few weeks and self-referencing agents are going to be at the top of everybody's twitter feed.
> "The LLM model's attention doesn't distinguish between "instructions I'm writing" and "instructions I'm following" -- they're both just tokens in context."
That means all these SOTA models are very capable of updating their own prompts. Update prompt. Copy entire repository in 1ms into /tmp/*. Run again. Evaluate. Update prompt. Copy entire repository ....
That is recursion, like Karpathy's autoresearch, it requires a deterministic termination condition.
Or have the prompt / agent make 5 copies of itself and solve for 5 different situations to ensure the update didn't introduce any regressions.
> reach local maximima unless external feedback is given
The agents can update themselves with human permission. So the external feedback is another agent and selection bias of a human. It is close to the right idea. I, however, am having huge success with the external feedback being the agent itself. The big difference is that a recursive agent can evaluate performance within confidence interval rather than chaos.
You understand the irony here? Is the issue that other people are in flame wars or that you open with negative sentiment comments and people respond in kind?
I appreciate your comment because it is an opportunity for me to test the extension works correctly surfacing your comment with little notification. Thank you. :)
> Should I update the instructions so something is more clear?
I believe they are referring to the "threads" link in the header of HN pages for an authenticated request. Can't say what the 20% is, of course. As for the 80%, it's a pull instead of a push but that page shows logged-in users their recent comments and the reply threads for them.
Sorry for any ambiguity, I meant the HackerNews header link, and for me the 20 is that I'm not sure if that link shows replies to submissions I make. Your project looks awesome, and like a great solution for people looking for push notifications.
If one other person uses it, it is good enough for me.
If you use it and want more features, post in the issue queue or respond to one of my comments -- I'll get it.
The engine that makes the requests and does the logic is agnostic and probably is portable copy and paste into your project. The one thing I have are all the tests and red team adversary agents that do very well to surface bugs.
There is a relationship between the tokens in the output in the model's vector space, that is the most important, and something hidden we will never see.
reply