yeah but do we really need some trash reality-TV for a "shared social experience"? most of TV's programming was garbage anyway and contributed to a lot of what was/is wrong with the society
this is literally just “leave a child at the work computer with a real doc open playing office”. otoh it is good to design benchmarks tonground these things.
on the flip side if you’re literally just using a bare bones harness on top of a stochastic parrot, of course stochastic errors accumulate.
theres a lot of ways for improving text faithfulness through harness tool designs, and my incremental experiments seem promising.
but unless work is gated on shit like “the script used must type checked ghc haskell or lean4”, unsupervised stuff is gonna decay
i mean of course. ive been working on this the past few months and ive a bunch of tech towards this in flight, including some harness forks to layer my ideas in. eg my oh punkin pi test bed on my github.com/cartazio page , theres some shockingly obvious ince you see it tricks that i think i can stack into a really nice harness product for just doing hard real work with these models more easily
the funny thing is once the llms got mostly good enough in november 2025 for me, it was mind boggling how much it helped me get stuff out of my head with ease.
its easier for me to code now, because its like i have a 24/7 insane intern that needs to be supervised via pair programming but also understands most topics enough to be useful/ dangerous.
ironically ive been spending much of my time iterating on ways to improve model reasoning and reliability and aside from the challenge of benchmark design, ive had some pretty good success!!
my fork of omp: https://github.com/cartazio/oh-punkin-pi has a bunch of my ideas layered on top. ultimately its just a bridge till i’ve finished the build of the proper 2nd gen harness with some other really cool stuff folded in. not sure if theres a bizop in a hosted version of what ive got planned, but the changes ive done in my forks have made enough difference that i can see the different in per model reasoning
im def working on benchmarks for how my own general harness improves task performance vs same model in a commodity setup. its hard to do!
i will say that my current harness: https://github.com/cartazio/oh-punkin-pi is a testbed for a bunch of 2nd gen harness tech, largely optimized for reasoning llms only. the next one after this harness is gonna be epicccc
That (and oh-my-pi) seem like an excessive swing in the other direction. Im all for the simplicity and minimalism of pi. There are just a few fundamental things that need updated (mainly subagent context and open-by-default security model).
yup thats mine. :)
i actually had some stuff layered into mono pi, and i frankly hit my limit in terms of architecture issues in monopi, omp aka oh my pi is frankly better architectured. if you pared back the fearure set to be minimal, you would full stop have a better designed minimal harness.
i do have a proper next gen no slop harness in the work.
amusingly , dog fooding existing tools with my improvements layered in, has repeatedly validated my design choices and if anything has reduced my tolerance for the errors that seem to happen in vanilla or first party harnesses
more than that, its pretty clear that there is an insane underinvestment in the harness layer. ive been iterating on my own ideas in that area through the lens of increasing reliability. and holy crap is there so much low hanging fruit. i literally can’t figure out a sustainable way to do the work without commercializing at that layer
reply