More

simonw · 2026-06-10T18:20:38 1781115638

NVIDIA are hosting a free endpoint for this one, details at https://build.nvidia.com/google/diffusiongemma-26b-a4b-it - you have to create an account and (I think) verify a phone number too.

(I got it to draw a pelican: https://tools.simonwillison.net/markdown-svg-renderer#url=ht... )

alfirous · 2026-06-10T18:26:37 1781115997

I register few weeks ago, the account still not verified, despite following the procedure. Can't use API if the account not verified.

simonw · 2026-06-10T14:58:30 1781103510

Blaming the head of infrastructure for distillation doesn't make sense to me.

simonw · 2026-06-10T14:38:24 1781102304

Suggestion: have more than just helm and Docker in your quickstart documentation. I'd like to try this out just to see what it can do, but not quite enough to fire up one of those systems for it.

Is there a binary I can run directly?

e12e · 2026-06-10T15:00:39 1781103639

In addition - the docker compose example doesn't set up any data volumes for the postgres instances - that might be considered a bug?

Then again, sharding on a single host probably isn't very useful anyway - but it might work with docker in swarm mode?

levkk · 2026-06-10T15:33:46 1781105626

The docker compose example is just a demo. I don't know anyone who runs Postgres with docker compose / swarm in prod :) But yes, happy to add volumes so it seems more real.

levkk · 2026-06-10T14:49:37 1781102977

We should add it to brew/apt/etc for sure. Also, we could add it to crates.io so you could do something like `cargo install pgdog`. Distribution, distribution, distribution.

simonw · 2026-06-10T15:03:31 1781103811

I also appreciate GitHub releases with pre-compiled binaries for different platforms. The more options the better!

simonw · 2026-06-10T14:03:09 1781100189

The US economy has survived 40+ years of buggy, no-automated-tests, no-version-control Excel spreadsheets. I think it will survive this too.

lionkor · 2026-06-10T16:50:37 1781110237

The difference is that bad untested excel spreadsheets didn't get trillion dollar valuation.

simonw · 2026-06-10T13:36:15 1781098575

The full context of that quote makes it clear that it's meant more as a wry joke:

> A venerable web application pattern that has had a small modern renaissance thanks to Remix, form submissions and redirects took a while to explain to my colleagues, on account of everyone being used to heavily client-side web applications.

(Although it's not really a joke, it's pretty amazing how many professional web developers these days don't know how to use forms without JavaScript.)

someonebaggy · 2026-06-10T13:46:29 1781099189

The opposite is why I'd never be a good web developer. I grew up messing around with PHP and if I spent the time to learn the modern stack, I'd constantly be thinking it's stupid.

dormento · 2026-06-10T14:33:59 1781102039

I can relate to that.

I recently had to intervene during the latest office holy war to explain that you don't need JS for file uploads.

It was eye opening.

simonw · 2026-06-09T20:50:29 1781038229

I've spent enough time with this now in Claude Code (and Claude.ai and Claude Code for web) to have an opinion on Fable 5: it's a beast. I'm throwing some VERY difficult problems at at - things I've been dragging my heels on for months - and it's crunching through them very happily.

One that I'm willing to share (albeit from just a week ago) - I built a Python library last week that bundles MicroPython compiled to WASM to create a sandboxed code execution library: https://github.com/simonw/micropython-wasm

I just told Claude.ai (not even Claude Code - this was the standard Claude chat interface) running Fable 5:

  Clone simonw/micropython-wasm from GitHub
  and research how this could use a full
  Python as opposed to MicroPython

A few prompts later (and I uploaded the zip files from https://github.com/brettcannon/cpython-wasi-build/releases/t... because Claude chat can't access those files itself) and I have a wheel file that bundles Python itself, compiled to WASM:

  uv run --with https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl \
    cpython-wasm -c 'print(45 ** 56)'

Here's the transcript: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

(It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.)

teiferer · 2026-06-10T06:06:41 1781071601

> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.

And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.

zylepe · 2026-06-10T11:39:10 1781091550

Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.

aspenmartin · 2026-06-10T13:04:04 1781096644

If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.

You can’t benchmaxx an eval that comes after your model release.

Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.

Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.

ElevenLathe · 2026-06-10T16:33:25 1781109205

> You can’t benchmaxx an eval that comes after your model release

Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

aspenmartin · 2026-06-10T17:37:17 1781113037

> Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.

andai · 2026-06-10T18:32:02 1781116322

Yeah, nobody's ever silently changed a model while it was deployed. That would be illegal!

aspenmartin · 2026-06-10T19:12:57 1781118777

Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.

Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.

bcrosby95 · 2026-06-10T15:20:55 1781104855

Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.

aspenmartin · 2026-06-10T15:42:53 1781106173

Um kind of but not really, it’s a mix of UX and actual measurements of what tasks it can do. Also UX is virtually the same thing: scaled quantitative surveys and preference metrics. It’s again, just benchmarking, and it’s done carefully and with best practices.

joquarky · 2026-06-10T16:32:27 1781109147

Imagine unironically starting your comment with "Um" in 2026.

aspenmartin · 2026-06-10T16:35:26 1781109326

You don't have to imagine!

andai · 2026-06-10T18:31:14 1781116274

I've been testing some models that score higher than Opus 4.6.

They:

- hallucinate constantly

- can't follow basic instructions

- think they're Claude for some reason ;)

p-e-w · 2026-06-10T12:54:36 1781096076

Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.

bluGill · 2026-06-10T13:06:53 1781096813

> students are evaluated by teachers with more knowledge and experience than them

This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

JadeNB · 2026-06-10T15:00:27 1781103627

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration)

I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)

cycomanic · 2026-06-10T15:19:53 1781104793

It certainly is true in physics and engineering that a PhD student at least half way through their PhD should know more than there supervisor about their topic (and usually much earlier). Even a Masters thesis project student should understand the intricacies of their project better than their supervisor. I'm speaking as someone who has supervised a significant number of both PhD and Masters students.

camdenreslink · 2026-06-10T15:52:29 1781106749

The original post said “in college”. It might be true for PhD candidates halfway through their program, but that’s like 0.5% of college students. The vast majority of students are leagues behind their instructors in domain knowledge.

bluGill · 2026-06-10T18:19:05 1781115545

I wouldn't say leagues behind, but otherwise I think we are on the same page, though I guess I worded it wrong. It is common for a couple students in any class to know more than the instructor in some niche part of the field even though the instructor has much more knowledge overall.

JadeNB · 2026-06-10T15:59:30 1781107170

Yes, I intentionally left out the next part of the quote about graduate school, since that seems more accurate. I was disputing only the part that I took to be pertaining to undergraduate education. The full quote is:

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

aspenmartin · 2026-06-10T14:08:41 1781100521

> Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

Jensson · 2026-06-10T15:33:43 1781105623

> How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.

But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.

aspenmartin · 2026-06-10T15:45:19 1781106319

Yet human judgement isn’t subject to side effects like fluency and persuasiveness? It’s like everyone in this thread dismisses benchmarks and then…describes a crappy benchmark.

Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.

Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?

naikrovek · 2026-06-10T12:30:25 1781094625

ya gotta have a vibe for everything if you want to compare vibes, though. you can't just have a vibe for fable 5 alone AND say that it's better than anything out there. there's no weight in that verdict at all, no meaning. it's like reviewing a book without reading it.

throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.

aspenmartin · 2026-06-10T14:07:50 1781100470

You are literally describing a benchmark

Wowfunhappy · 2026-06-10T11:26:02 1781090762

Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.

tsss · 2026-06-10T13:24:58 1781097898

> determine quantitatively forever whether Rust is a superior programming language to Go

Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.

lukan · 2026-06-10T16:58:35 1781110715

So .. where can we read about the results?

naikrovek · 2026-06-10T12:35:40 1781094940

> real life resists those kinds of measurements

no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.

Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.

"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.

zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.

andai · 2026-06-10T18:28:29 1781116109

I added "you can do anything if you believe" to my agent and it went from not even attempting things to just doing them effortlessly.

I know how stupid that sounds but it's true.

Well what do they say... "If it sounds stupid but it works, then it's not stupid!"

johnisgood · 2026-06-10T07:28:54 1781076534

Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P

Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.

lanstin · 2026-06-10T16:33:50 1781109230

"Check your work for mistakes after the first draft" maybe :)

Certhas · 2026-06-10T06:48:58 1781074138

There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.

So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...

hardwaregeek · 2026-06-10T10:23:39 1781087019

Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.

AlecSchueler · 2026-06-10T10:42:30 1781088150

No, relative performance between Python and Java can absolutely be measured.

skywhopper · 2026-06-10T11:01:00 1781089260

Yes, but performance is not the only factor in whether a specific language is better than another for a specific project.

bfrog · 2026-06-10T13:09:22 1781096962

How do you measure the performance of people? This is subjective and biased every time.

tezza · 2026-06-10T06:54:06 1781074446

It is possible to check for improvements. See for yourself:

https://generative-ai.review/2026/06/claude-fable-rush-test-...

As mentioned in another HN thread I've done a qualitative side-by-side measurements of Claude Fable vs Opus 4.8 vs ChatGPT 5.5.

Anyone is able to check the output for themselves and form a judgement.

Large visible improvements for Fable over Opus 4.8 and ChatGPT 5.5.

I recently did the same to show the progress from Opus 3.4/ChatGPT o3pro one calendar year ago.

1dom · 2026-06-10T07:14:31 1781075671

Sorry, this post gets me irrationally irritated and makes me want to shake you and shout.

That website is 95% not you, it's AI, and I feel that's causing you to way over-represent the value of it in your response here, or you're completely misunderstanding what the person you're responding to is asking. If you put all of your effort into that site, without AI, it would be infinitely more valuable and useful.

The person you responded to asked for specific things, including:

- obvjective, unbiased measurements, but all that page has is side by side visual comparison of outputs.

- their different generations, but all you included was the outputs

- details on the prompts and little things people are adding because they feel they need to, but you didn't include any of that

This is slop, it's the exact sort of self confirming fluffy AI stuff that other either inexperience or over-invested-in-AI engineers will look at briefly, skim, see quick visual validation, and nod, noting down how much better Fable must be without getting any actual data.

Sorry, it's early, and maybe this is a misplaced rant, but the person you responded to specifically asked for precise, quantitative things precisely because everything else is fluffy slop like this, and people don't even recognise they're doing it any more.

tezza · 2026-06-10T07:22:44 1781076164

check the backlinks[1][2] in the article before you start throwing around accusations. I am not (yet) a person that has advanced notice and access to models.

Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.

If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.

How is a side by side direct comparison NOT precise?

[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix

.

[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.

1dom · 2026-06-10T07:53:43 1781078023

I did browse and check the links. This was the first link I went to: https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... as it's the main one on the page, and I saw more qualitative stuff without quantitative stuff.

I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.

I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.

Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.

tezza · 2026-06-10T08:13:33 1781079213

There are benchmarks if you want quantitative results. Mine is qualitative, and clearly billed as such. Comparison and contrast still possible.

lionkor · 2026-06-10T08:45:49 1781081149

This is NOT a misplaced rant, this is a very good description of what I feel as well. You've put it very well.

user43928 · 2026-06-10T09:53:17 1781085197

I reads like an unhinged rant about AI and the engineers who use it, with the entitled tone of people who think they have permission to insult someone's competence and work because AI was used.

In my opinion, if one cannot express themselves civilly, they should refrain from commenting.

1dom · 2026-06-10T11:16:17 1781090177

I disagree. I wouldn't consider it unhinged. I'm clearly aware of my own frustration. It's also relatively civil, since I was able to temper it with appropriate apologies and acknowledgements. Many other people agree and support the sentiment of what I'm saying.

AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.

It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.

user43928 · 2026-06-10T11:42:29 1781091749

I found the website you ranted about interesting, comparing the quality of the visualization between the different models.

I don't think it was "a huge waste of time" or needed your rant.

You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.

What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.

1dom · 2026-06-10T15:22:02 1781104922

This is slop, in the sense that it looks like a lot of useful work and effort, and AI is heavily involved, and it was offered up when the opposite was requested, meaning it's not at all helpful in this context.

I raised this in a harsh, but repeatedly apologetic way. The person then responded telling me to "get my facts straight" and doubled down with more weak, qualitative outputs of LLMs.

I don't assume the person is incompetent because they used LLMs. I use them daily. I'm a firm believer everyone is an idiot, just in a different subject.

The issue here I feel is that LLMs are increasingly leading people think that they're not an idiot in any subject at all, and when real humans question it, they double down with more AI stuff.

jgilias · 2026-06-10T18:41:21 1781116881

Oh boy. I see this so much.

leodavi · 2026-06-10T11:48:06 1781092086

How is this meaningfully different than simonw's pelicans riding a bicycle? If anything, this seems to be of a higher caliber?

thewhitetulip · 2026-06-10T07:25:06 1781076306

It feels like hand written software will now be "bespoke"

disgruntledphd2 · 2026-06-10T13:23:30 1781097810

artisanal, hand-crafted software.

contextfree · 2026-06-10T06:57:33 1781074653

fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)

thewhitetulip · 2026-06-10T07:26:05 1781076365

How many $ do you guys spend when your session runs for 30min? What's the total budget?

ElFitz · 2026-06-10T07:37:32 1781077052

That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.

farley13 · 2026-06-10T12:14:00 1781093640

I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.

It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.

torginus · 2026-06-10T11:17:22 1781090242

Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed

lqstuart · 2026-06-10T12:54:06 1781096046

It’s almost like they’re interchangeable. We need to start asking these models to solve extremely difficult, contrived DSA coding questions before deciding which ones we employ

vonneumannstan · 2026-06-10T17:40:50 1781113250

The first thing in the release page is benchmark results...

https://www.anthropic.com/news/claude-fable-5-mythos-5

kmacdough · 2026-06-10T09:47:31 1781084851

I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.

"Don't make mistakes" does seem dumb. It's not guidance.

solumunus · 2026-06-10T13:39:05 1781098745

Just treat it like an employee with infinite energy. You can never really measure the productivity or ability of employees, it’s just pretty obvious when one is better than another. You’re asking them to do things and they’re either coming up with the goods or they aren’t. You can’t really expect much more from agents either but I’m not sure why you need anything more.

alecco · 2026-06-10T11:57:01 1781092621

> These comparisons are all gut feelings.

https://simonwillison.net/about/#disclosures

"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."

But I'm totally unbiased on my gut-feeling posts, trust me bro.

-- AI influencers.

simonw · 2026-06-10T12:59:14 1781096354

Anthropic didn't give me early access to this model, shouldn't that bias me against it?

deagle50 · 2026-06-10T14:46:28 1781102788

You kinda proved the point...

simonw · 2026-06-10T15:05:27 1781103927

deagle50 · 2026-06-10T16:18:45 1781108325

If you're that easily biased then why trust your assessment?

simonw · 2026-06-10T16:27:00 1781108820

Where did I say I was biased?

deagle50 · 2026-06-10T17:49:41 1781113781

the hypothetical you presented above

simonw · 2026-06-10T17:51:53 1781113913

It was a hypothetical. How does presenting a hypothetical equate to proving anyone's point here?

alias_neo · 2026-06-10T12:25:24 1781094324

This isn't some random dipshit, this is Simon Willison[1]. He has a bit more cred than some "AI influencer".

[1]https://en.wikipedia.org/wiki/Simon_Willison

kansface · 2026-06-10T01:57:26 1781056646

Yes, exactly this. If I didn't care about price at all, I'd exclusively use this model. It functions more like an actual engineer. I'm in the midst of a DB migration, and eg 5.5 continually suggests stuff like "use DB X instead of DB Y for task Z because its 30% faster" which is an impossibility of reality, given we are migrating DBs. Fable jumped in, reduced allocs by literally 46x, found multiple bugs 4.8 and 5.5 created (max file system usage, correctness issues, etc), and continually suggested awesome improvements unprompted. As in, it would finish a task and then suggest we tackle this other existing problem I didn't know about in a very specific manner... this is the first model that feels like its coming for my job.

josephg · 2026-06-10T02:43:57 1781059437

I'm having the same experience. I'm in the process of implementing a new CRDT for realtime collaborative editing. There just aren't a lot of implementations of CRDTs kicking around online for opus or any of the other models to have good design instincts.

Fable is doing - so far - a great job. I just had one big question around how part of it should work. I had a design sketch, but with some big unknowns. I asked fable to figure it out via reasoning and prototyping, and it did - it even, under its own initiative, wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it. And it found, and fixed, a couple bugs that I'd missed.

I'm sure its weaknesses will become apparent in time. But, wow this thing is a beast. Its the first time I'm reading the work of an LLM without spotting obvious weaknesses in its reasoning and code. I'm really impressed.

infinitebit · 2026-06-10T05:15:13 1781068513

I was about to ask where you work that you’re implementing new CRDTs and then I noticed your username! Thanks for all that you do!

I work on the live collab at my company, and using AI while coding has into recently sort of “clicked” for me. We use an (I’m pretty sure) unheard of algorithm for collaborative editing, and I’ve had a long term goal of turning it into an implementation of EG Walker, but our document model is very complex and most out of the box CRDTs don’t quite fit. Maybe Fable will be what gets me over the hump.

aquariusDue · 2026-06-10T09:09:19 1781082559

Long shot here because I'm not knowledgeable enough about CRDTs but maybe something like DSON would help? I saw a talk about it a while ago and it might be useful.

https://blog.helsing.ai/posts/dson-a-delta-state-crdt-for-re...

https://www.youtube.com/watch?v=4QkLD7JhD_I&pp=ygUJZHNvbiBjc...

infinitebit · 2026-06-10T18:27:37 1781116057

Ty, checking this out!

josephg · 2026-06-10T06:14:27 1781072067

I’d be fascinated to hear more if you’re willing to share. What is special about your document model which makes existing tools like automerge a bad fit?

infinitebit · 2026-06-10T18:24:57 1781115897

We have cross-field invariants that merging at the data structure level can't ensure (in an obvious way, at least), and "lose the semantic meaning of a conflict". The main idea behind their approach is that certain parts of the model can have custom "mergers" that are able to run business logic to maintain these invariants.

Worth noting, the decision to eschew CRDTs predates my time here, and I've pushed for a CRDT rewrite quite a bit since I believe it could be done. The other main concern they had was memory usage, but it seems like EG Walker would solve that. Our system uses a "Commit DAG", (an Event DAG by another name), and does a three-way merge using a common ancestor of the diverged documents, and so a lot of the bones of EG Walker are there, and I'm exploring ways in which we could gradually move to it.

teiferer · 2026-06-10T06:09:51 1781071791

> wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it.

For such a data structure, "nailing it" means a formal proof of correctness. Fuzzing, as useful as it is, is merely throwing dirt at the wall and seeing if anything sticks.

josephg · 2026-06-10T06:13:23 1781072003

I’ll ask it for a formal proof when I get home and see how it goes.

I’ve read plenty of papers with “formal proofs of correctness” that turned out to have huge flaws. Machine verifiable proofs I trust. But I’ve personally found more bugs with fuzzing than I have via proofs.

noduerme · 2026-06-10T06:15:32 1781072132

In the real world, many of us don't have the time to create formal proofs. But our instinct in testing where edge cases may exist in code that we wrote is a type of refactoring that happens in our brains during the coding process. Hand the coding off to a machine and you have no idea where to start looking for the flaws.

bluGill · 2026-06-10T13:12:33 1781097153

> Hand the coding off to a machine and you have no idea where to start looking for the flaws.

I have found this quickly becomes false. I have learned I cannot review llm generated code as if it is written by a trusted senior developer (where I often just do a quick look, see nothing obvious and hit approve). Once you start reading the code in depth with the goal of understanding you quickly see the places where flaws are likely. Sure I start with no clue where to look, but it doesn't take long to see things.

hnewsdaniel · 2026-06-10T07:34:50 1781076890

Hello joseph,

I saw scanning the comments and saw you mentioned CRDT. Just wanted to mention that I implemented a CRDT-flavoured sync engine for the product I'm working on a while ago, I think it was with Opus 4.6 if I'm not mistaken (or earlier) so it's not something new to Fable 5, just fyi.

josephg · 2026-06-10T13:27:04 1781098024

Yeah, you've certainly been able to get Opus to write a CRDT. It just needs a lot of hand-holding to make it correct. Opus always seems pretty bad at coming up with invariants and using them to make a piece of software correct. Without invariants, you end up with lots of hacky workarounds to avoidable problems.

So far at least - and its been less than a day - Fable seems better at this.

I think I also do my CRDTs differently from others. I've grown to like the pure-oplog approach after making eg-walker. LLMs are much worse at this!

weatherlite · 2026-06-10T07:05:10 1781075110

> this is the first model that feels like its coming for my job

Damn you must be good, I've been feeling this for around 2 years now

literalAardvark · 2026-06-10T08:36:16 1781080576

It's been obvious for at least 2 years, anyone who doesn't see the writing on the wall simply hasn't learned how to use these well or has severe exponential blindness.

"But it doesn't do well when writing my undertrained language" - yeah, fine. Yet. Reasonable code in that is probably one RAG + verification scaffold deployment around Mythos or maybe mythos+1. Just like it was for you learning it, because you knew how to _program_.

weatherlite · 2026-06-10T11:18:48 1781090328

Yeah I agree. We're headed into a rougher job market pretty much across the board for white collar work , hitting junior people worse at this stage. Up to societies around the world to decide how to deal with this - so far we deal with it by ignoring it it seems.

10GBps · 2026-06-10T13:04:34 1781096674

The monks got mad too when the printing press was invented because it took their jobs of hoarding knowledge.

AI is just another tool, learn to use it.

FeteCommuniste · 2026-06-10T16:41:55 1781109715

And then in a couple years the AI gets better at "using AI" than the bottom 99.999% of knowledge workers, who are now out of work.

OtomotO · 2026-06-10T13:33:09 1781098389

We are all doomed! Doomed I say!

spoiler · 2026-06-10T15:26:18 1781105178

Gosh, I must be doing something wrong. I spent 15 minutes (of which a lot was waiting while it was thinking about "backwards rationalising" it's decision and "gaslighting"[1]) arguing with it over why it keeps using `node -e "console.log(require('fs').readdirSync('…'))"` instead of `ls -l …`.

Like it did everything:

- this is not a Linux system (true, it was macOS) - it is not an available command - the binary is corrupted - node/js is more precise - V8 JavaScript is faster than bash (true technically??? But not in this context lol) - JavaScript is more versatile

I forgot what else we went through but there were a few more things. I indulged it because it was incredulous and funny. The prompts from my side were all questions, never instructions. I assume an instruction would've helped here, but also I don't think Opus ever did this (but on the other hand Opus wrote python scripts to format/indent, instead of just running cargo fmt, so I guess potato potato)

boc · 2026-06-09T22:30:51 1781044251

Yeah same here, Fable on "high" is producing substantially better results than Open 4.8 on xhigh for me and my actual real-world evals today. It "feels" smarter and doesn't use nearly as many tokens running in circles. As a result I've been able to run two large refactors today without hitting the context limit danger zones - it's more expensive but also more efficient. It's been able to find some bugs that Opus missed. Pretty impressive stuff.

garciasn · 2026-06-09T22:49:48 1781045388

I keep getting this message:

> Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more

I'm working on an internal tool that does new business prospecting data collection, scoring, etc. This is ridiculous.

andy12_ · 2026-06-10T13:14:44 1781097284

I don't know if you are aware, but some people reported in Twitter that Fable 5 may flag the message regardless of content if it knows (from either pretraining knowledge or memories) that you work in either of those fields. I don't know if that's your case.

https://x.com/i/status/2064449457869984035

algoth1 · 2026-06-09T22:51:16 1781045476

It’s unusable for me due to the refusals. I’m using claude to find patterns in health data

yakz · 2026-06-09T23:56:15 1781049375

I do some work in laboratory automation and it was quick to refuse the first thing I asked it to do. There wasn't anything spicy in the request, just basic liquid-handling protocol implementation. Their position seems to be that they're too stupid to classify requests safely, and that seems reasonable to me. I'd guess the classifier will improve rapidly.

5d41402abc4b · 2026-06-10T04:32:34 1781065954

Have you tried locally running qwen?

mrbuttons454 · 2026-06-10T15:12:46 1781104366

Is there a Qwen that I can run locally that is anywhere near these frontier models?

Der_Einzige · 2026-06-10T17:23:49 1781112229

No, and don't let anyone gas light you into thinking the answer is yes.

dmd · 2026-06-09T23:28:46 1781047726

Same. I'm working on a set of python and matlab scripts that deals with segmenting MRI images into brain vs skull, and it thinks that's bioterrorism.

rvnx · 2026-06-10T01:07:14 1781053634

Quite counterproductive to refuse to help on health issues too. If they detect health data, they can add a disclaimer, but not hide the information.

secult · 2026-06-10T07:18:40 1781075920

You miss the point - by collecting and processing medical data they would fall into a thoroughly regulated industry. Not because they may provide you incorrect data, because they are not allowed to process them.

girafffe_i · 2026-06-10T04:33:00 1781065980

There’s no way around it? Can’t you obfuscate as generic data and use keys to map to the real data?

algoth1 · 2026-06-10T11:01:27 1781089287

I guess you could even turn everything into numbers, not a bad idea at all!

fragmede · 2026-06-10T02:52:51 1781059971

What custom prompt do you have set up? If you tell it you're occupation, does it turn helpful? There was a study that if you tell models they tested that you're a patient, it would refuse, but tell it you're a doctor and suddenly it turns helpful.

garciasn · 2026-06-10T03:43:12 1781062992

According to the model, it’s not the model itself that’s doing this, it’s the harness.

Assuming the model is being “truthful”, CC is just being stupid in its detection mechanism.

5d41402abc4b · 2026-06-10T04:32:02 1781065922

what prompts do you use for this?

UltraSane · 2026-06-09T23:49:29 1781048969

Anthropic knows it refuses too much, they want to be very cautious to avoid any scandals. I think this is why they want to store all Fable and Mythos chats for 30 days so they can use the data to improve.

hirako2000 · 2026-06-10T01:56:58 1781056618

They want to be very cautious to honour the important doctrine at least until IPO launches: we are so good we are nerf our products.

fn-mote · 2026-06-10T01:31:21 1781055081

I’m a point where I expect everything I do will be retained indefinitely.

I’m having a really hard time believing some weak reason for a 30 day retention policy.

garciasn · 2026-06-09T22:54:30 1781045670

I wonder if it sees Healthcare companies being targeted and that's why it's freaking out; clearly they have some pretty stupid regexes in the harness to detect this sort of shit.

e: I quit the session and went back in. Set it to Fable and told it to continue the last session. It's moving along as if none of that had happened.

How weird.

throwaway20222 · 2026-06-09T23:08:58 1781046538

I wonder if this letter has anything to do with why anything even remotely related to biology is getting flagged.

https://www.wired.com/story/openai-anthropic-letter-ai-biolo...

iambateman · 2026-06-10T01:13:27 1781054007

I asked a question for my son about how mosquitos carry malaria and Fable was like “ok now hold it right there”

the__alchemist · 2026-06-10T13:22:29 1781097749

Interesting! I have not used Fable, but so far have not hit trouble. I'm a hobby biologist with a home mol bio lab. It wouldn't answer my questions about LNPs, but so far has been fine for my recombinant DNA workflows, lab techniques, environmental DNA protocols etc. I suspect this may become more difficult!

LouisvilleGeek · 2026-06-10T01:20:12 1781054412

Same here. It's been rushed for the IPO (in my opinion).

fragmede · 2026-06-10T02:50:37 1781059837

Or people were quitting their subscription for codex-5.5 and it was beginning to show up in their metrics.

brookst · 2026-06-10T03:08:48 1781060928

Or development had gotten to a point where they need real world usage to tune product and refusals.

Or Fable’s arch is different enough the allocated clusters of compute targeting a date, and here we are, ready or not.

Or…

fumar · 2026-06-10T00:49:00 1781052540

Same I am working on music firmware for existing device. I can't proceed as it keeps switching to Opus.

piokoch · 2026-06-10T07:16:36 1781075796

Obviously, soon, for anything valuable, you will have to buy from Anthropic "special license for biology/security/finance advises".

Question is if there will be any competition in this area...

black_knight · 2026-06-09T22:20:07 1781043607

Still does not crack my hardest nuts. Gave it one of them and it blew through my entire allowance on thinking about one question, with no apparent answer in sight!

I see a lot of people saying they are happy with weaker models, but I am the opposite, I need more strength, more intelligence!

I am quite happy that opus 4.8 can do some medium intelligence problems. And maybe Fable 5 can do some more more of those! I have a lot of problems to solve!

user43928 · 2026-06-10T07:25:39 1781076339

I also see a lot of people saying they are happy with weaker models.

At work I had to switch to using GPT 5.4 Mini and Qwen 3.6 27B.

The results were near useless.

The error rate is through the roof, it's constantly incorrect in its conclusions even when investigating very simple issues.

Further the models are too unreliable to even move 20 line snippets around without inadvertently modifying them. Ask them to correct it and they still get it wrong.

Maybe the larger Chinese models are better, but the Mini stuff is next to useless to me.

black_knight · 2026-06-10T07:34:59 1781076899

I have Qwen 3.6 27B and 35B running locally and and coming from Opus it feels like talking to an imposter. Someone who pretends to be competent, but really isn’t. Results are always disappointing. Sonnet is better, but I have given up on asking it. even for simple things I wait for my opus limits to reset.

abalashov · 2026-06-10T11:58:59 1781092739

Have you tried Kimi K2.6 or DeepSeek V4 (Flash or Pro)?

daymanstep · 2026-06-09T22:25:31 1781043931

What kind of problems are you trying to have it solve ?

_kb · 2026-06-09T22:35:34 1781044534

The Riemann hypothesis, PvNP, and the Collatz conjecture.

black_knight · 2026-06-09T22:42:53 1781044973

Not these. I wonder if the well is poisoned there. The models know that these are "unpossible", so it might not solve them just because… Maybe some day.

I am just testing it on stuff I know intimately myself. I would probably not understand a proof of Collatz if it was dansing in front of me!

Lerc · 2026-06-10T00:55:18 1781052918

That's a bit of a tricky point. I have had quite a lot of problems with models informing me what I am attempting is impossible. If no-one has done it, or at least it doesn't know about it being done it tends to fall back on people voicing their baseless speculations, and for just about anything you propose, you can find a person who will loudly proclaim it is impossible.

The curse of the 'use case' comes in here too. When people think that everything should have a use case, that's a lot of training data suggesting to a model that things should only be used for what someone has already thought of.

A couple of times I have had to manually code proof of concept pieces so that the model breaks out of that "unpossible" mode and actually helps me.

I can't remember if it was chatGPT or Claude, but when I showed it how to get a MessagePort in its JavaScript executor through to the artifact/canvas, it quickly went from "That can't be done" to positively enthusiastic about the possibilities. I suspect those shenanigans will be well off the table for Fable though.

komali2 · 2026-06-10T02:33:44 1781058824

So, what kind of problems are you having it try to solve?

Sorry to belabor this but it's basically pointless saying you have nuts it can't crack without showing us the nuts.

black_knight · 2026-06-10T06:01:48 1781071308

I don’t care to share my exact problems. Mostly because gpt -5.5 hallucinates false solutions, and I would rather not have people reply with "Oh but ChatGPT solves it!", because it takes expert knowledge to debunk them. To their credit ChatGPT will admit their, very fundamental mistakes when pointed out to them. But also because no-one would really care.

I gave a high level description of the problems in a sibling thread. They are the kind of small problems which I suppose every researcher has lying around, waiting for them to think about some day. But not the big problem everyone is waiting for to be solved.

My comment was not meant to be a tease – sorry! I assumed there would be other people in a similar situation, who might relate.

neonstatic · 2026-06-10T02:56:05 1781060165

Bro, you are being left behind bro, it's amazing bro...

unnouinceput · 2026-06-10T01:45:51 1781055951

Stop dancing and share the prompt, we're dying to see it

black_knight · 2026-06-10T06:24:31 1781072671

Hey, stop asking to see my nuts! My nuts are private – okay?

(Joking aside, see sibling threads.)

moffkalast · 2026-06-10T09:18:53 1781083133

Ayy lmao

mastermage · 2026-06-10T06:27:37 1781072857

is this a joke? Seriously? These are some of hardest problems in Math period. 100 if not thousands of the greates minds in history have attempted to solve these problems. And you think that the current level of AI can blow through them? It is also a possibility that for example the Riemann Hypothesis is just not provable. (Goedels Theorem).

black_knight · 2026-06-10T06:36:15 1781073375

No one is expecting that! I expect _kb was sarcastic/making a point.

Recently (last couple of months?) these models are becoming useful tools for mathematicians, because they can solve easier problems more quickly, meaning that one can tackle bigger challenges (but maybe not RH et al) piece by piece.

But, there are still definite limits, where one could expect an expert human to solve things, given time, but models do not. Thus, more intelligence would be nice!

mastermage · 2026-06-10T06:42:06 1781073726

if it was sarcastic then whoosh on me.

black_knight · 2026-06-09T22:38:59 1781044739

The medium ones are results where one needs to construct some object, which my intuition tells me should exist. The difficult ones are typically to show that certain objects can not be constructed.

These are not Fields medal type problems, nor know difficult/open conjectures. Just small stuff I have collected in my todo list over the years.

Certhas · 2026-06-10T07:26:46 1781076406

I have some medium difficulty math problems where I have used the models for the last year and a half repeatedly. Back then they were already good at pointing out obstructions and constructing counterexamples. So that tracks. But at first glance it looks like Fable actually made real progress on one problem for the first time.

A year ago my judgement was that I had wasted my time on trying to work with the models and doing things myself would have been more productive as I would have gained intuition from the failures. Now it definitely seems to have figured out stuff that would have taken me more time than I have to spare on this problem...

black_knight · 2026-06-10T07:48:43 1781077723

Cool! Yes, we are getting there.

Being a theory builder more than a problem solver I am excited for the future.

Also excited for fully formalised mathematics to hit main stream!

tclancy · 2026-06-10T01:49:32 1781056172

Perhaps you should rephrase those nuts?

larodi · 2026-06-10T07:05:08 1781075108

One thing I can tell you is you are either favored by Anthropic, or your version of the CLI does not exhaust limits, or there's some major bug, as two people around me (myself included) claim it took half an hour to hit the ceiling. Which makes it practically unusable, where the same workflow a day ago produced a good 5-6 hours of workload with several agents.

witx · 2026-06-10T11:32:11 1781091131

They are most likely shills from Anthropic, there's quite a few here everytime new models come out.

miyoji · 2026-06-10T14:29:17 1781101757

That's not fair. Simon is a well-known shill for the entire AI industry, not just Anthropic.

simonw · 2026-06-10T14:44:28 1781102668

What's your definition of "shill"?

Jensson · 2026-06-10T15:46:17 1781106377

Probably means fan, shills have undisclosed ties and I doubt he means Simon has undisclosed ties to the entire AI industry, that would be very impressive if so.

cedws · 2026-06-10T13:18:50 1781097530

It’s not meant for subscription users; the subscriptions are just the gateway drug to Enterprise pricing which Anthropic intends to use to juice their numbers before IPO.

desmond1303 · 2026-06-10T14:01:01 1781100061

Or use API billing? We have access to it at my company with no limits

piokoch · 2026-06-10T07:13:18 1781075598

Monetization is coming. They'll tell companies, AI is replacing your workers, so it is still worth to pay 100K/year for the license, as those AI are not going to jump to other job, get sick, be late, complain, require free coffee and so on.

Soon the times of AI for $20/$200 a month will be long gone.

tarkin2 · 2026-06-10T09:29:49 1781083789

Get people hooked, tell them spending time coding is no longer needed, let their skills deteriorate, tell them they need cough up for a licence to do their job

Forcing developers to pay for models that were build on code they scraped scott-free

A tax to do their job that developers are jumping at the chance to pay

Everybody's finally realising that node dependencies are a threat, but letting these AI companies gatekeep the industry is a bandwagon people are scrambling towards

witx · 2026-06-10T11:45:13 1781091913

> Forcing developers to pay for models that were build on code they scraped scott-free.

Yes this makes me sad behound explanation. Specially when I see open source developers happily using these tools. These companies stole your, free, hard work and charge you a subscription!! Not to speak about them torrenting books and (most likely) training on private repos.

This and devs paying a subscription to use a tool that is marketed as trying to replace them.

I had 150$ monthly budget thatbI used for various open source projects and I've cut that entirelly.

simonw · 2026-06-10T13:39:16 1781098756

> These companies stole your, free, hard work and charge you a subscription!!

In case you weren't aware, Anthropic, OpenAI and GitHub Copilot all have programs that provide access to open source maintainers for free:

GitHub: https://docs.github.com/en/copilot/how-tos/copilot-on-github...

Anthropic: https://claude.com/contact-sales/claude-for-oss

OpenAI: https://developers.openai.com/community/codex-for-oss

majora2007 · 2026-06-10T13:32:42 1781098362

I don't get what you're saying. You're frustrated that Open Source projects were used to build these AIs and that OS devs (or devs in general) are paying to use AI.

Then you say you had money that you used to donate(?) to OS and have cut that because of the frustration?

Open source just means sharing the source code for people to learn off or have the ability to customize on their own. I don't think there is any need to be frustrated about that (now if it was copyright/private of course).

witx · 2026-06-10T13:43:41 1781099021

> Open source just means sharing the source code for people to learn off or have the ability to customize on their own.

Yes people, not corporations. The point is there a licenses to be respected that weren't.

lkjdsklf · 2026-06-10T15:42:27 1781106147

Model training pretty clearly falls under fair use.

We could fix that, but it requires a political will to change the law.

witx · 2026-06-10T16:46:56 1781110016

If you look carefully model training is a very good relicensing exercise of your code

bingaweek · 2026-06-10T16:09:07 1781107747

This has not been determined in courts and your willingness to speak so confidently about it speaks volumes.

simonw · 2026-06-10T16:20:01 1781108401

The closest we've come to a court decision on this so far has been the Anthropic case, which did indeed find that training on unlicensed data falls under fair use: https://www.documentcloud.org/documents/25982181-authors-v-a...

> To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies.

paganel · 2026-06-10T09:54:24 1781085264

> Forcing developers to pay for models that were build on code they scraped scott-free

That's also caused by some very smart (even brilliant) developers (you can see many of them in this very thread) choosing to be oblivious about all this and bury us all under, hoping that they'll be among the last ones to go. Writing this down I realise that they maybe aren't all that smart.

thewebguyd · 2026-06-10T15:17:30 1781104650

I've been saying this since the beginning, the rug pull is coming. If these models can eventually replace a human worker, there is no reason these companies won't charge (and get away with it) very close to a typical SWE salary.

It would not surprise me one bit to see anywhere from $80k-$100k/seat pricing.

larodi · 2026-06-10T07:18:08 1781075888

As someone noted here recently - use the frontier models as much as u can, while you can.

miroljub · 2026-06-10T08:12:17 1781079137

Thankfully, we have Chinese models we can use for a fraction of the price.

Not everyone needs a Ferrari to go for a weekly shopping.

baq · 2026-06-10T09:45:20 1781084720

A Ferrari will likely lap you when you’re racing, though, and the market and the economy is a race. You’ll be facing a question soon, or your employer will, whether to spend a significant chunk of free cash on fable-class tokens or on literally anything else instead - wages and salaries included.

iugtmkbdfil834 · 2026-06-10T10:25:05 1781087105

<< You’ll be facing a question soon, or your employer will

Maybe? If you talk to executives, the impression that I am getting is that they tend to be somewhat misinformed at best, which, yes, is bound to result in some really bad decisions down the road. But, and it is not a small but, the ones I did talk to ( and, amusingly, those are the ones with strong opinions ) don't seem to have a lot, um, practical exposure to this tech beyond what they heard at the watercooler. Honestly, it is kinda infuriating. And all this before we get to how companies want to say they use AI, but also keep cost down.

miroljub · 2026-06-10T10:45:19 1781088319

Yeah, sure. In the same way I can see only Ferraris driving as taxis, company cars, transport vehicles, used by post, delivery services ...

You and your work are not that special, you are not participating in car races, and you don't need a Ferrari.

simonw · 2026-06-10T13:00:56 1781096456

Are you on the $100/month subscription?

joshstrange · 2026-06-10T13:28:20 1781098100

I am, and I used up the entire 5 hour window in 8min using the highest thinking setting. It also ate up $15 of extra usage before I noticed.

I’ve done the same thing with opus multiple times with no issue. According to ccusage I racked up just shy of $100 of tokens using Fable.

It spun up subagents or workflows or whatever so obviously that contributed but “double opus” was not my experience. I’ve done the exact same prompt with opus on the highest setting and only once before (not even while using this prompt) hit my limits.

My prompt? I’m not a prompt wizard or anything but it was literally:

> Please review the uncommitted code in this repo for bugs/issues/code smells.

I use variations on that all the time with opus and never had issues. I figured it was a good one to kick the tires with Fable. Little did I know it would mean no more Claude Code for the next 4.5hrs (unless I wanted to pay) after this being the first time I had used CC that day (yesterday).

All in all, a pretty crappy first experience.

simonw · 2026-06-10T14:09:29 1781100569

Try running this command: and see what it thinks you spent at API prices:

  uvx agentsview usage daily

Then edit the config file to add Fable pricing as described here: https://til.simonwillison.net/llms/agentsview-custom-model-p...

And run the command again. I get $126.89 for yesterday.

joshstrange · 2026-06-10T14:21:37 1781101297

Hmm, I tried that and made the config file change but it didn't work for me. I just see:

    DATE        INPUT    OUTPUT   CACHE_CR  CACHE_RD   COST     MODELS
    ----        -----    ------   --------  --------   ----     ------
    2026-06-09  142015   85315    321224    6880110    $10.96   claude-fable-5, gpt-5.5, claude-haiku-4-5-20251001

I tried to filter down to just fable (or 5.5 so I could deduct it) but the `--agent` flag doesn't seem to work how I'd expect...

I think the $10.96 is coming from gpt-5.5 since I switched to it once I exhausted all my usage on CC. CCusage reports completely different numbers so I don't know which one of those is right.

Thanks for trying, for yesterday ccusage says "$92.02" for claude, which I assumed was the Fable usage.

simonw · 2026-06-10T14:28:44 1781101724

If you run this:

  uvx agentsview serve

You'll get a localhost web application which makes it much easier to filter by model.

joshstrange · 2026-06-10T14:38:41 1781102321

That's very interesting, I had not used agentsview at all before today and I'll have to keep that in my back pocket.

Unfortunately it's not telling the whole story. The last message from the _only_ Fable session it monitored was:

> The data layer looks clean — <REDACTED>. Now waiting on the 11-angle workflow — verification and the gap sweep run after the finders; I'll compile the full ranked findings list when it completes.

And my memory jives with that, I could see in the footer that it had spun up 11 agents (though agentsview says it used 0 subagents, don't know if it was "actually" workflows that it spun up?). It's like it didn't record the sub-sessions/sub-agents info?

I'm still shocked that my prompt (which I now can see thanks to this tool) of:

> Please review all the uncommitted work in this repo and identify any issues.

was able to burn so much, so quickly, and, most frustratingly, without actually doing anything useful because killing it was my only option lest it spend even more of "extra usage".

Overview of usage: https://cs.joshstrange.com/RjGzWVXy

Stats for that 1 session: https://cs.joshstrange.com/Fj5qv1wl

simonw · 2026-06-10T14:43:38 1781102618

Can you tell in AgentsView if Fable spun up a bunch of Opus/Haiku/etc subagents that burned tokens as well?

joshstrange · 2026-06-10T14:49:11 1781102951

It's as if it spun up a bunch of subagents but agentsview doesn't report on it. I see a tiny bit of Haiku use once I turn on all models (except gpt-5.5).

https://cs.joshstrange.com/z9x6SPcC

jsw97 · 2026-06-10T13:42:39 1781098959

simonw, if you are not bumping up against the same false-positive guardrail problems and budget consumption that everyone else is, then that is something worth digging into. I would normally say that's crazy but IPOs put weird pressure on companies.

simonw · 2026-06-10T15:02:31 1781103751

I've had a couple of guardrail blocks.

I've been watching my usage quota bars drop as I use the model, so I don't think I have a weird quota issue going on here.

sexylinux · 2026-06-10T07:19:07 1781075947

It still does make errors, yes? Because it is not usable, if we need to verify everything. AI is only interesting if it can do things that humans can not do. If you can verify results because you can do it yourself, then why use AI? It will just bind highly skilled people to do verification work. Instead these people should do the actual work, results will come quicker.

So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?

If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.

zahlman · 2026-06-10T12:38:55 1781095135

> AI is only interesting if it can do things that humans can not do.

AI is interesting as long as it can save time and/or money in getting an acceptable result. Anything that runs on a computer and can do "things that humans can do" will automatically end up doing things that humans won't do, simply by virtue of the fact that it runs on a machine that doesn't require sleep, doesn't get bored or demotivated, etc.

Verifying code (to a level where a responsible person is willing to take ownership for it) isn't trivial, sure; but writing the code by hand requires the same level of care, and the fact that the same person wrote it doesn't actually allow for shortcuts (if we're being properly responsible).

iwontberude · 2026-06-10T14:33:24 1781102004

It doesn’t get bored or demotivated, but it also lacks interest and motivation generally so it comes with the same pitfalls of having nothing to lose and being utterly unaccountable, (e.g. destructive actions, lying, and being coercive or Machiavellian for no reason other than efficiency in achieving an arbitrary and artificial status of completion).

Lutger · 2026-06-10T08:20:36 1781079636

Humans make mistakes too, does it mean humans are unusable? We accept as empirical fast that most production quality code has 2 - 10 bugs per 1k LoC. According to your premise, virtually all existing software is therefor unusable.

What if an LLM overall starts to make less mistakes than a medium developer, costs less than its salary and is 100 x faster? For sure, the companies that will leverage these with just a few senior devs doing prompting, testing and requirements analysis, will outcompete other organizations.

nalekberov · 2026-06-10T09:43:48 1781084628

Humans make mistake then to learn from it. A really good expert would never deliberately copy-paste an obscure solution from the internet, then to ask for forgiveness later.

AI agents do that, perhaps not always, but still do. Now the question: would I trust AI without verifying its output?

camdenreslink · 2026-06-10T16:24:42 1781108682

Humans also make mistakes in ways that other humans can understand or expect. Sometimes LLMs make mistakes in a way that makes you say “no human would have ever done that”.

CookieCrisp · 2026-06-10T15:27:47 1781105267

There is plenty of work that does not need to be perfectly verified, because the risk is controlled. Prototyping a javascript game for example. Or code that runs just on your local machine where good enough is good enough. I'm sure a lot of you do super important work that needs 100% quality code all the time, but... some of us don't.

dbbk · 2026-06-10T15:42:00 1781106120

This is what tests are for.

OvervCW · 2026-06-10T10:53:35 1781088815

One does not need to be able to create it themselves to evaluate if the output is correct. Consider for example that you can easily determine if a meal tastes delicious without being an expert chef, or the fact that NP problems are very difficult to solve but make for easily verifiable solutions.

anygivnthursday · 2026-06-10T08:29:36 1781080176

Yeah, it makes the same old errors, being confidently wrong then sorry... I mean, it is still an LLM

misja111 · 2026-06-10T07:22:39 1781076159

AI is like a junior developer. You have to review her code carefully but she is most definitely useful.

rllj · 2026-06-10T07:56:00 1781078160

Why is your AI a she? What's up with gendering LLMs. Reminds me of Richard Dawkins calling Claude "Claudia" and insisting it to be conscious.

zahlman · 2026-06-10T12:40:37 1781095237

I think GP was gendering the hypothetical junior dev, rather than the AI.

latentsea · 2026-06-10T10:14:16 1781086456

This is part of the training data now. She can hear you, you know...

naasking · 2026-06-10T15:30:23 1781105423

> Because it is not usable, if we need to verify everything.

Do you verify every line of code written by your fellow developers? I doubt it, which is strange because they make errors don't they?

What matters is the error rate. Past some threshold and they're better than senior devs who you don't supervise closely.