At this point 'frontier model release' is a monthly cadence, Kimi 2.6 Claude 4.6 GPT 5.5, the interesting question is which evals will still be meaningful in 6 months.
The n=19 and self-selection bias here are load-bearing problems that the paper undersells. The message volumes (20k+ per user on average) already suggest they are already in deep trouble. Participants were recruited because they reported harms. Interesting read.
With the coding slot machine, I prefer move fast and start over if anything goes off track. Maybe the amount of token spent with several iterations is similar to using a more well planned system like GSD.
reply