If it is a one-off task, it doesn't matter if you use GUI or Terminal commands to do it. But more than once, terminal starts paying off IMO.
Here are some advantages.
- It is repeatable, you can do the same exact thing you did before. With ZSH history + FZF, recalling a command is a breeze.
- Auditability. The command in your shell history is there for you to revisit and servers as a permanent record of something you did (or didn't do).
- A command line doesn't make a mistake at 10th time, due to fatigue, inattention etc.
- Reusability. You may have to repeat the same command for different folders (or remote servers). A slight modification of the previous command will do it for you.
Not just _wrong_. It is confused! It is actually right in the second sentence.
This was Friday, Opus 4.6.
>I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.
This is actually a good diagnostic of whether the model is skimping on the thinking loop. Try raising thinking effort and it should get it right. Of course, if you're running this in a coding harness with a whole lot of extraneous context, the model will be awfully confused as to what it should be thinking about.
I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.
Yeah, it was probably patched. It could reason novel problems only of you ask it to pay attention to some particular detail a.k.a. handholding..
Same would happen with the the sheep and the wolf and the cabbage puzzle. If you l formulated similarly, there is a wolf and a cabbage without mentioning the sheep, it would summon up the sheep into existence at a random step. It was patched shortly after.
I’m not sure ‘patched’ is the right word here. Are you suggesting they edited the LLM weights to fix cabbage transportation and car wash question answering?
Absolutely not my area of expertise but giving it a few examples of what should be the expected answer in a fine-tuning step seems like a reasonable thing and I would expect it would "fix" it as in less likely to fall into the trap.
At the same time, I wouldn't be surprised if some of these would be "patched" via simply prompt rewrite, e.g. for the strawberry one they might just recognize the question and add some clarifying sentence to your prompt (or the system prompt) before letting it go to the inference step?
But I'm just thinking out loud, don't take it too seriously.
You are right, this is not a rewrite like the Bun case.
The real news is, at 50M LOC, it is able to handle and do _something_ coherent.
reply