Hacker Newsnew | past | comments | ask | show | jobs | submit | NitpickLawyer's commentslogin

> a lot of the hype around running these models locally is bullshit. Sure, you can make it do something but certainly nothing useful or substantial.

There is certainly a lot of hype around local models. Some of it is overhype, some of it is just "people finding out" and discovering what cool stuff you can do. I suspect the post is a reply to the other one a few days ago where someone from hf posted a pic with them in the plane, using a local model, and saying it's really really close to opus. That was BS.

That being said, I've been working with local LMs since before chatgpt launched. The progress we've made from the likes of gpt-j (6B) and gpt-neoX (22B) (some of the first models you could run on regular consumer hardware) is absolutely amazing. It has gone way above my expectations. We're past "we have chatgpt at home" (as it was when launched), and now it is actually usable in a lot of tasks. Nowhere near SotA, but "good enough".

I will push back a bit on the "substantial" part, and I will push a lot on "nothing useful". You can, absolutely get useful stuff out of these models. Not in a claude-code leave it to cook for 6 hours and get a working product, but with a bit of hand holding and scope reduction you can get useful stuff. When devstral came out (24B) I ran it for about a week as a "daily driver" just to see where it's at. It was ok-ish. Lots of hand holding, figured out I can't use it for planning much (looked fine at a glance, but either didn't make sense, or used outdated stuff). But with a better plan, it could handle implementation fine. I coded 2 small services that have been running in prod for ~6mo without any issues. That is useful, imo. And the current models are waaay better than devstral1.

As to substantial, eh... Your substantial can be someone else's taj mahal, and their substantial could be your toy project. It all depends. I draw the line at useful. If you can string together a couple of useful tasks, it starts to become substantial.


Someone probably did it for an internal demo, as a joke. Then people pushed it upwards, until someone clueless approved it.

> With MoE models, you could fetch expert layers from the network on demand

This is a common misconception, probably due to the unfortunate naming. Expert layers are not "expert" at any particular subject, and active-size only refers to the activated layers per token. You'd still need all (or most of all) the layers for any particular query, even if some layers have a very low chance of being activated.

All in all, you'd be better off with lazy loading the entire model, at least you'd know you have the capability to run inference from then on.


Ultimately it would amount to lazy-loading the model, but the parameters themselves would be fetched from the network as needed, which still decreases time-to-first-token. It's true that "expert" choices will span most of the model, regardless of any particular "subject" or "topic" choice, but if we simply care about time-to-first-token it's still a viable strategy.

Perhaps you could generate a few tokens before the entire model is downloaded, but since every token takes a potentially different "path" through an MoE model, you'd still need to wait for the entire download before getting deeper than a handful of tokens... which is not really a UX improvement imo.

Even at its worst, it's a minor UX improvement compared to having to download everything prior to getting to the first token. Ultimately we will complete the download, but we can still pick the best priority so that the first handful of tokens goes through.

> used to be skeptical of the government provenance

Do you mean skeptical on which government was responsible or that it was in fact a government effort?

I can see how attribution could be debatable (between two main suspects mainly), but are / were there any good arguments against this being a gov effort? I would find it highly unlikely that someone other than a gov could muster up so much domain knowledge, source pristine 0days and be so stealthy at the same time.


I didn't want to give bumptious government CNE teams that much credit, and also a lot of the indicators people were giving of state origin didn't seem all that predictive. I don't agree with your premise that it takes a state-level adversary to collect the domain knowledge needed to do this stuff, and I certainly don't agree about the "pristine zero days".

> What I do not understand

It's not you. It's the people that were somehow convinced that serving crap is gonna "hurt" the models. These are people who have 0 clue on how models are trained and how they work, but have been riled up by others who similarly don't understand the technical details, but have strong biases against them. This is ignorance signalling at its finest.

And, as expected, it's hurting their (regular) users more than they'll hurt the model trainers. Oh well..


> Remember when people thought solving Erdos problems required intelligence? Is there anything an LLM could ever do that would cound as intelligence?

Hah. It reminds me of this great quote, from the '80s:

> There is a related “Theorem” about progress in AI: once some mental function is programmed, people soon cease to consider it as an essential ingredient of “real thinking”. The ineluctable core of intelligence is always in that next thing which hasn’t yet been programmed. This “Theorem” was first proposed to me by Larry Tesler, so I call it Tesler’s Theorem: “AI is whatever hasn’t been done yet.”

We are seeing this right now in the comments. 50 years later, people are still doing this! Oh, this was solved, but it was trivial, of course this isn't real intelligence.


That is a “gotcha” born of either ignorance (nothing wrong with that, we’re all ignorant of something) or bad faith. Definitions shift as we learn more. Darwin’s definition of life is not the same as Descartes’ or Plato’s or anyone in between or since because we learn and evolve our thinking.

Are you also going to argue definitions of life before we even learned of microscopic or single cell organisms are correct and that the definitions we use today are wrong? That they are shifting goal posts? That “centuries later, people are still doing this”? No, that would be absurd.


I don't see it as a gotcha. Just an (evergreen, it seems) observation that people will absolutely move the goalposts every time there's something new. And people can be ignorant outsiders or experts in that field as well.

For example, ~2 years ago, an expert in ML publicly made this remark on stage: LLMs can't do math. Today they absolutely and obviously, can. Yet somehow it's not impressive anymore. Or, and this is the key part of the quote, this is somehow not related to "intelligence". Something that 2 years ago was not possible (again, according to a leading expert in this field), is possible today. And yet this is somehow something that they always could do, and since they're doing it today, is suddenly no longer important. On to the next one!

No idea why this is related to darwin or definitions of life. The definitions don't change. What people considered important 2 years ago, is suddenly not important anymore. The only thing that changed is that today we can see that capability. Ergo, the quote holds.


> For example, ~2 years ago, an expert in ML

See, that’s a poor argument already. Anyone could counter that with other experts in ML publicly making remarks that AI would have replaced 80% of the work force or cured multiple diseases by now, which obviously hasn’t happened. That’s about as good an argument as when people countered NFT critics by citing how Clifford Stoll said the internet was a fad.

> made this remark on stage: LLMs can't do math. Today they absolutely and obviously, can.

How exactly are “LLMs can’t” and “do math” defined? As you described it, that sentence does not mean “will never be able to”, so there’s no contradiction. Furthermore, it continues to be true that you cannot trust LLMs on their own for basic arithmetic. They may e.g. call an external tool to do it, but pattern matching on text isn’t sufficient.

> The definitions don't change.

Of course they do, what are you talking about? Definitions change all the time with new information. That’s called science.


The definition of "can/cannot do math" didn't change. That's not up for debate. 2 years ago they couldn't solve an erdos problem (people have tried, Tao has tried ~1 year ago). Today they can.

Definitions don't change. The idea that now that they can it's no longer intelligence is changing. And that's literally moving the goalposts. Read the thread here, go to the bottom part. There are zillions of comments saying this.

You are keen to not trying to understand what the quote is saying. This is not good faith discussion, and it's not going anywhere. We're already miles from where we started. The quote is an observation (and an old one at that) about goalposts moving. If you can't or won't see that, there's no reason to continue this thread.


> The definition of "can/cannot do math" didn't change. That's not up for debate.

That is not the argument. The point is that the way you phrased it is ambiguous. “Math” isn’t a single thing, and “cannot” can either mean “cannot yet” or “cannot ever”. I don’t know what the “expert” said since you haven’t provided that information, I’m directly asking you to clarify the meaning of their words (better yet, link to them so we can properly arrive at a consensus).

> Definitions don't change.

Yes they do! All the time!

https://www.merriam-webster.com/wordplay/words-that-used-to-...

> And that's literally moving the goalposts.

Good example. There are no literal goal posts here to be moved. But with the new accepted definition of the words, that’s OK.

> There are zillions of comments saying this.

Saying what, exactly? Please be clear, you keep being ambiguous. The thread barely crossed a couple of hundred comments as of now, there are not “zillions” of comments in agreement of anything.

> You are keen to not trying to understand what the quote is saying. (…) If you can't or won't see that, there's no reason to continue this thread.

Indeed, if you ascribe wrong motivations and put a wall before understanding what someone is arguing, there is indeed no reason to continue the thread. The only wrong part of your assessment is who is doing the thing you’re complaining about.


He’s a booster and I don’t think he argues in good faith.

He seems to be fixated on this notion that humans are static and do not evolve - clearly this is false. What people thought as being a determinant for intelligence also changes as things evolve.


New, unbenched problems are really the only way to differentiate the models, and every time I see one it's along the same lines. Models from top labs are neck and neck, and the rest of the bunch are nowhere near. Should kinda calm down the "opus killer" marketing that we've seen these past few months, every time a new model releases, esp the small ones from china.

It's funny that even one the strongest research labs in china (deepseek) has said there's still a gap to opus, after releasing a humongous 1.6T model, yet the internet goes crazy and we now have people claiming [1] a 27b dense model is "as good as opus"...

I'm a huge fan of local models, have been using them regularly ever since devstral1 released, but you really have to adapt to their limitations if you want to do anything productive. Same as with other "cheap", "opus killers" from china. Some work, some look like they work, but they go haywire at the first contact with a real, non benchmarked task.

[1] - https://x.com/julien_c/status/2047647522173104145


Benchmarks for LLMs without complete information about the tested models are hard to interpret.

For the OpenAI and Anthropic models, it is clear that they have been run by their owners, but for the other models there are a great number of options for running them, which may run the full models or only quantized variants, with very different performances.

For instance, in the model list there are both "moonshotai/kimi-k2.6" and "kimi-k2.6", with very different results, but there is no information about which is the difference between these 2 labels, which refer to the same LLM.

Moreover, as others have said, such a benchmark does not prove that a certain cheaper model cannot solve a problem. It happened to not solve it within the benchmark, but running it multiple times, possibly with adjusted prompts, may still solve the problem.

While for commercial models running them many times can be too expensive, when you run a LLM locally you can afford to run it much more times than when you are afraid of the token price or of reaching the subscription limits.


Agreed. But, at least as of yesterday, dsv4 was only served by deepseek. And, more importantly, that's what the "average" experience would be if you'd setup something easy like openrouter. Sure, with proper tuning and so on you can be sure you're getting the model at its best. But are you, if you just setup openrouter and go brrr? Maybe. Maybe not.

I think it's important to point out that DeepSeek was basically soft-launching their v4 model, and they weren't emphasizing it as some sort of SOTA-killer but more as proof of a potentially non-NVIDIA serving world, and as a venue for their current research approaches.

I think/hope we'll see a 4.2 that looks a lot better, same as 3.2 was quite competitive at the time it launched.


I feel like many of the “as good as opus” crowd would achieve the same with sonnet tbh. Actually reaching the ceiling of what Opus can do is maybe 10% of tasks, the rest is wasting compute on a too-strong model they default to for whatever they are doing. Hence they see little drop in output quality when trying out smaller open models.

It is a price signalling problem both in the API and the subscriptions.

Difference between $3/MTok and $5/MTok does not reflect the capability difference. Similarly the subscriptions have extra Opus specific allowances. I prefer Sonnet for most tasks it is good enough, but sometimes I am forced to use Opus because am out of Sonnet for the week. It feels like Opus is unwanted second option.

If they priced Sonnet closer to $1/MTok like other models it would signal value of Opus better.


The Eval problem; “Alice is supposedly smarter than Bob, but they can both tie their shoes just as fast”.

The question isn't whether it's "as good as Opus" but that there exists something that costs 1/10th the cost to use but can still competently write code.

Honestly, I was "happy" with December 2025 time frame AI or even earlier. Yes, what's come after has been smarter faster cleverer, but the biggest boost in productivity was just the release of Opus 4.5 and GPT 5.2/5.3.

And yes it might be a competitive disadvantage for an engineer not to have access to the SOTA models from Anthropic/OpenAI, but at the same time I feel like the missing piece at this point is improvements in the tooling/harness/review tools, not better-yet models.

They already write more than we can keep up with.


Also, at some point the open models may be poured into silicon and become 100x as fast. Maybe when allowed to use more tokens the models can solve the problems which they now cannot. So this would imho be an interesting addition to the benchmark.

Oh, I agree. Last year I tried making each model a "daily driver", including small ones like gpt5-mini / haiku, and open ones, like glm, minimax and even local ones like devstral. They can all do some tasks reliably, while struggling at other tasks. But yeah, there comes a point where, depending on your workflows, some smaller / cheaper models become good enough.

The problem is with overhypers, that they overhype small / open models and make it sound like they are close to the SotA. They really aren't. It's one thing to say "this small model is good enough to handle some tasks in production code", and it's a different thing to say "close to opus". One makes sense, the other just sets the wrong expectations, and is obviously false.


There is no doubt that for many tasks the SotA models of OpenAI and Anthropic are better than the available open weights models.

Nevertheless, I do not believe that either OpenAI or Anthropic or Google know any secret sauce for better training LLMs. I believe that their current superiority is just due to brute force. This means that their LLMs are bigger and they have been trained on much more data than the other LLM producers have been able to access.

Moreover, for myself, I can extract much more value from an LLM that is not constrained by being metered by token cost and where I have full control on the harness used to run the model. Even if the OpenAI or Anthropic models had been much better in comparison with the competing models, I would have still been able to accomplish more useful work with an open-weights model.

I have already passed once through the transition from fast mainframes and minicomputers that I was accessing remotely by sharing them with other users, to slow personal computers over which I had absolute control. Despite the differences in theoretical performance, I could do much more with a PC and the same is true when I have absolute control over an LLM.


I am desperate for the tooling that puts me back in charge. And just has the models as advisor. In which case the "smart level" is just a dial.

I'm probably going to have to make it myself.


Some ide’s already have this. In zed you can stick it “ask” mode.

Being able to use it as a rubber duck while it can also read the code works quite well.

There are a few APIs at work I have never worked on and the person that wrote them no longer works with us so AI fills that gap well.


Our philosophy is that you can design problems so that they can scale through a few release cycles by making environments more complex, with no known ceiling. The key for scalability is not having a single correct answer (even though Victor's benchmark is interesting), but still being objectively scorable.

That's what we've done with our comprehensive reasoning and coding benchmark at https://gertlabs.com


"almost as good as opus at writing python/js/... when given a spec" might be enough for a lot of people, especially if its 10x cheaper

i dunno, Opus is losing it's edge imo. i regularly use a mix of models, including Opus, glm 5.1, kimi 2.6, etc. and i find that all of them are pretty much equally good at "average" coding, but on difficult stuff they're nearly equally bad. i can't deny that Opus has an edge, but it's not a huge one.

tbf, we've learned (ha!) more from smashing teeny tiny particles and "looking" at what comes out than from say 40 years of string theory. Sometimes doing stuff works, and the theory (hopefully) follows.

Same with electricity. We had Ohm's Law and were building electrical devices (e.g. telegraph, lightbulb) long before we discovered the electron.

As this is a new arch with tons of optimisations, it'll take some time for inference engines to support it properly, and we'll see more 3rd party providers offer it. Once that settles we'll have a median price for an optimised 1.6T model, and can "guesstimate" from there what the big labs can reasonably serve for the same price. But yeah, it's been said for a while that big labs are ok on API costs. The only unknown is if subscriptions were profitable or not. They've all been reducing the limits lately it seems.

Is there evidence that frontier models at anthropic, openai or google or whatnot are not using comparable optimizations to draw down their coats and that their markup is just higher because they can?

> (better than Opus 4.6)

There we go again :) It seems we have a release each day claiming that. What's weird is that even deepseek doesn't claim it's better than opus w/ thinking. No idea why you'd say that but anyway.

Dsv3 was a good model. Not benchmaxxed at all, it was pretty stable where it was. Did well on tasks that were ood for benchmarks, even if it was behind SotA.

This seems to be similar. Behind SotA, but not by much, and at a much lower price. The big one is being served (by ds themselves now, more providers will come and we'll see the median price) at 1.74$ in / 3.48$ out / 0.14$ cache. Really cheap for what it offers.

The small one is at 0.14$ in / 0.28$ out / 0.028$ cache, which is pretty much "too cheap to matter". This will be what people can run realistically "at home", and should be a contender for things like haiku/gemini-flash, if it can deliver at those levels.


Anthropic fans would claim God itself is behind Opus by 3-6 months and then willingly be abused by Boris and one of his gaslighting tweets.

LMAO


> Anthropic fans ...

I have no idea why you'd think that, but this is straight from their announcement here (https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg):

> According to evaluation feedback, its user experience is better than Sonnet 4.5, and its delivery quality is close to Opus 4.6's non-thinking mode, but there is still a certain gap compared to Opus 4.6's thinking mode.

This is the model creators saying it, not me.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: