Hacker Newsnew | past | comments | ask | show | jobs | submit | dataviz1000's commentslogin

During training they gate with a lot of guardrails the format of the reasoning tokens output. They don't just use a reward for getting the correct answer during training but also reward human readable output. That said, if they didn't, the reasoning tokens that are the most efficient to get to the final correct answer during training would most likely look like a lot of gibberish.

There is a relationship between the tokens in the output in the model's vector space, that is the most important, and something hidden we will never see.


You can configure this to start tracking replies that are 90 days or less if you set the 'catch up on replies from past'.

Have you tried recursive self-reflective agents?

The agent makes a copy of itself in /tmp/. Runs. Evaluates. Updates itself. Makes a copy of itself. Runs. Evaluates. Updates itself. Makes a ...... you get the idea.

They will not stop if the recursion is given a hard to meet termination condition. Also, if it can cheat to solve the termination condition it will.


I have not run one personally, but I love the idea. Reminds me of yoyo-evolve. My friend made this repo: https://github.com/dwolner/cosmic-insight

The red team adversaries are so effective. If Claude is blind to a bug, it won't surface using the same model from a red team adversary perspective. It requires using a different model which gpt-5.5 is great for. Yesterday I tried for the first time using gpt-5.5 as a adversary against the tests themselves. Later I thought it would be interesting to create a trickster agent which breaks the code after copy entire project into /tmp/ in order to control every aspect of it. Claude insists this called mutation testing. It would create regressions and then run all the tests. Finally it was able to unsupervised create an effective test harness.

lambench is single-attempt one shot per problem.

I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly.

The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples.

That said, using lambda calculus is a brilliant subject for benchmarking.

The models are reliably incorrect. [0]

[0] https://adamsohn.com/reliably-incorrect/


Why 45 times in particular? If you want 80% power to distinguish a model at 50% from a model at 51%, you need 39,440 samples per model, or 329 samples per question per model. But that would just give you a more precise estimate of how well the model does on those 120 questions in particular. If you want a more precise estimate of how well the model might do on future questions you come up with, you'll need to test more questions, not just test the same question more times.

I made flame charts of sonnet thinking. [0] You can see there is a lot of variance over 5 runs. They all passed but there was one that struggled with errors. How many trials are needed to clamp to ceiling or floor? ~30?

[0] https://adamsohn.com/lambda-variance/


How many samples you need depends on the difference you want to be able to measure (0% to 1% is different from 50% to 51% is different from 0% to 10% is different from 50% to 60%), the significance level at which you will declare a difference (conventionally, p < 0.05) and how likely you want this to happen when there is indeed such a difference (statistical power, conventionally 80%). Of course you can also just sample an arbitrary number of times and compute confidence intervals after the fact, but doing a statistical power computation helps clarify what it is you want to know, how certain you want to be, and whether you can realistically achieve such knowledge with the budget you have.

To solve the lambda calculus problem Sonnet burns 8,163 - 17,334 tokens on 5 runs.

If I want to engineer a prompt, starting with the tokens which are clearly better in the one with 8,163 will yield a better agent.

If I build an agent that does something arbitrary like reverse engineer any website or multiplies 2 large numbers without a tool that allows it to use code, the mechanics of the reasoning work the same as an agent solving lambda calculus. Running 39,440 trials is prohibitory expensive. Nonetheless, without perfect proof, I want to say running an agent several times and then take any generalized output from the fastest runs yields much faster generalized agent that solves that specific task given different parameters.

That is something I really want to know. If I have an agent that reverse engineers websites, can I take the thinking output from the best running and use that to seed a better agent? I don't know how to set up the experiment. And asking ChatGPT has been futile especially and running it is very expensive. How do I set up that experiment?


You could try a sequential testing setup, which can let you stop the experiment earlier if the difference is larger than expected. But if the difference is small, there's no way around the fact that reliably detecting small differences requires large sample sizes, and the relationship is inverse quadratic (halving the smallest detectable difference quadruples the sample size you need).

Even people benefit from multiple tries over time.

Well, to be fair, people cheat by remembering what they did last time. I think the idea here is to run the models from a "clean slate" and see how often they succeed/fail.

They are, like people, non-deterministic, so giving them several "fair" trials makes sense to me.


> there is no quantitative measure of performance here

Have them do multiplication or other complicated arithmetic. You say that isn't difficult. Then why do they burn 200k tokens in 20 minutes without converging? I did a deep exploration to help myself understand here [0].

[0] https://adamsohn.com/reliably-incorrect/


LLM models and the agents that use them are probabilistic, not deterministic. They accomplish something a percentage of the time, never every time.

That means the longer an agent runs on a task, the more likely it will fail the task. Running agents like this will always fail and burn a ton of token cash in the process.

One thing that LLM agents are good at is writing their own instructions. The trick is to limit the time and thinking steps in a thinking model then evaluate, update, and run again. A good metaphor is that agents trip. Don't let them run long enough to trip. It is better to let them run twice for 5 minutes than once for 10 minutes.

Give it a few weeks and self-referencing agents are going to be at the top of everybody's twitter feed.


It’s also that agents and ML reach local maximima unless external feedback is given. So your wiki will reach a state and get stuck there.

Here is an iteresting thing.

> "The LLM model's attention doesn't distinguish between "instructions I'm writing" and "instructions I'm following" -- they're both just tokens in context."

That means all these SOTA models are very capable of updating their own prompts. Update prompt. Copy entire repository in 1ms into /tmp/*. Run again. Evaluate. Update prompt. Copy entire repository ....

That is recursion, like Karpathy's autoresearch, it requires a deterministic termination condition.

Or have the prompt / agent make 5 copies of itself and solve for 5 different situations to ensure the update didn't introduce any regressions.

> reach local maximima unless external feedback is given

The agents can update themselves with human permission. So the external feedback is another agent and selection bias of a human. It is close to the right idea. I, however, am having huge success with the external feedback being the agent itself. The big difference is that a recursive agent can evaluate performance within confidence interval rather than chaos.


> Negative value.

You understand the irony here? Is the issue that other people are in flame wars or that you open with negative sentiment comments and people respond in kind?

I appreciate your comment because it is an opportunity for me to test the extension works correctly surfacing your comment with little notification. Thank you. :)


I don't understand what you mean.

Should I update the instructions so something is more clear?

Also what is the other 20%?


> Should I update the instructions so something is more clear?

I believe they are referring to the "threads" link in the header of HN pages for an authenticated request. Can't say what the 20% is, of course. As for the 80%, it's a pull instead of a push but that page shows logged-in users their recent comments and the reply threads for them.


Sorry for any ambiguity, I meant the HackerNews header link, and for me the 20 is that I'm not sure if that link shows replies to submissions I make. Your project looks awesome, and like a great solution for people looking for push notifications.

Yes, the badge on the icon was counting the replies to my comments and posts as it was important them. What is the other 20%?

I don't mean the 80/20 of your project, I mean of my use case. It looks like HNswered would be able to handle pretty close to all 100% of my use case.

But the nature of the 80/20 analogy is that oftentimes the 80 is good enough for most


Awesome!

If one other person uses it, it is good enough for me.

If you use it and want more features, post in the issue queue or respond to one of my comments -- I'll get it.

The engine that makes the requests and does the logic is agnostic and probably is portable copy and paste into your project. The one thing I have are all the tests and red team adversary agents that do very well to surface bugs.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: