Hacker Newsnew | past | comments | ask | show | jobs | submit | kgeist's commentslogin

Judging by the benchmarks on Artificial Analysis, "a very real leap over every model" is 2-3 points over competitors (say, 62 for Fable 5 vs. 59 for ChatGPT 5.5 xhigh for coding).

>LLM-type AI exacts huge costs because it is terrible at reporting "I don't know". When it doesn't know, it generates noise and polishes it.

>If a "confidence too low for output" signal could be extracted, this whole technology would be a lot more useful

Anthropic's interpretability research explored this topic a bit in 2025. Apparently, the signal is extractable:

  It turns out that, in Claude, refusal to answer is the default behavior: we find a circuit that is "on" by default and that causes the model to state that it has insufficient information to answer any given question. However, when the model is asked about something it knows well—say, the basketball player Michael Jordan—a competing feature representing "known entities" activates and inhibits this default circuit (see also this recent paper for related findings). This allows Claude to answer the question when it knows the answer. In contrast, when asked about an unknown entity ("Michael Batkin"), it declines to answer.

  By intervening in the model and activating the "known answer" features (or inhibiting the "unknown name" or "can’t answer" features), we’re able to cause the model to hallucinate (quite consistently!) that Michael Batkin plays chess.

  Sometimes, this sort of “misfire” of the “known answer” circuit happens naturally, without us intervening, resulting in a hallucination. In our paper, we show that such misfires can occur when Claude recognizes a name but doesn't know anything else about that person. In cases like this, the “known entity” feature might still activate, and then suppress the default "don't know" feature—in this case incorrectly. Once the model has decided that it needs to answer the question, it proceeds to confabulate: to generate a plausible—but unfortunately untrue—response.
https://www.anthropic.com/research/tracing-thoughts-language...

That's real progress. The paper behind it is [1] They try to extract "attribution graphs" to understand why the LLM produced some result. It's encouraging to see more work on what's going on inside. They obtained insight into a specific kind of hallucination: not finding a specific fact. "We uncover circuit mechanisms that allow the model to distinguish between familiar and unfamiliar entities, which determine whether it elects to answer a factual question or profess ignorance. “Misfires” of this circuit can cause hallucinations." That should be tested on queries which resulted in making up legal citations.

[1] https://transformer-circuits.pub/2025/attribution-graphs/met...


What is consciousness? For, me it's being aware of one's internal processes. Evolutionarily, I view it as dynamic intelligence: static intelligence has a fixed in-out pipeline, while dynamic intelligence allows one to reflect on the reasoning pipeline itself and make dynamic corrections to it => better adaptability.

If we define consciousness this way, then a plain transformer is not conscious because it's not able to explain its outputs properly or make corrections to the pipeline (i.e. "cannot modify its own system prompt", if simplified). But an ensemble of LLMs "orchestrator/analyzer + reasoning subagents" can probably viewed as something approximating 'consciousness".

I think a chess engine can be proclaimed conscious if it has the properties listed above. However, my very simple and mechanistic definition of consciousness is debatable, especially since by many it's conflated with "soul".


If an LLM, which abstractly is just a stack of transformers, is able to reason about itself, why wouldn't a chess engine (also a stack of transformers) also be able to reason about itself?

That reasoning may not manifest as English that we can read, but that doesn't mean it isn't there.

As information bounces through the many layers of the model, I find it plausible that you could get reflection in there somewhere.

I also find it reasonable that consciousness may not even require reflection / introspection.

That's the main problem with consciousness. We really have no idea how it works, therefore it's really difficult to conclusively state that something lacks consciousness


I have to disagree with most claims. I run Qwen3.6-27b at 260k context and 40-60 tok/sec. It handles most coding problems as well as Sonnet 4.6 under OpenCode on our production tasks. (As an experiment, I run the same prompts for the same issues in parallel for Qwen 3.6 and Sonnet 4.6 and usually see little difference in performance). I see zero degradation from quantization in practice.

Settings: RTX 5090, 5-bit weights (Unsloth), FP8 KV cache.

Last time I tried running large MoEs on this PC, they had inferior quality at 2-3 bits compared to much smaller dense models at 5-6 bits, and were slower anyway.


A 260k context (close to the stock maximum for Qwen, though it's possible to extend it) will take ~16GB RAM for storing the KV cache, barring quantization tricks which severely degrade quality. That's a whole lot more than what DeepSeek requires for a similar context length, and makes it infeasible to batch multiple inferences together. This used to be the status quo for consumer inference, in fact it still is for models like Kimi and GLM (which can sometimes be smarter than even DeepSeek V4 Pro!) but we can also do better nowadays.

>the reasoning behind the argument is bizarre

>Decomposing the complex activity into simple steps like 'predicting the next word' and claiming that surely can't have consciousness

I agree. I think the whole point of natural "intelligence" is to predict future events in order to properly plan actions for survival. The only difference is that next-word prediction happens in a different space (in a non-physical, textual form). But I don't think the distinction matters that much, because by the time the signals reach our brain's "intelligence core," they're already preprocessed multiple times. We can't see real physical reality, we "hallucinate" colors, touch, etc. (I think schizophrenia occurs when this kind of "controlled hallucination" goes wonky, but I may be wrong). So we're not that different from LLMs here.

I'd define consciousness as meta-intelligence, i.e. the ability to reflect on why/how a prediction was made (and make corrections to the pipeline). LLMs so far cannot properly explain why a certain prediction took place, but I'm not sure humans can fully either! I remember there was research showing that by scanning the brain (or some other signals), you can predict a person's choice before they're even aware of it. It's possible that our explanations are post hoc as well, and that the meta-cognitive ability to explain our own reasoning is as rudimentary as LLMs' (see: all the biases).

If you think about it this way, the only difference left is that everything we do is based on survival. LLMs don't have this goal. But I'm not sure it's relevant to the concept of consciousness.

A few months ago, I recreated Qwen3's architecture in 30 lines of code, and it gave me a sort of existential crisis: does it really take just 30 lines of code and a float array to recreate something that thinks and sounds almost like me? Is that all there is to it? There's often the argument that the brain is much more complex than 30 lines of code, which is true, but in my opinion, a lot of brain structures are basically archaic legacy systems (or auxiliary subsystems) that are not strictly necessary for human intelligence (which is bolted onto those legacy systems). If you carefully remove 95% of the brain (if you know where to remove), you can probably still have consciousness left in some form, it just wouldn't be very capable of surviving on its own.

I think our ability to debate consciousness in machine learning systems is greatly hindered by our subconscious existential fear: the fear that we ourselves can be reduced to non-conscious bits and simple mechanisms. In a way, it feels like a destruction of the ego.


I applaud the effort, but every time there's a new hobbyist programming language on HN, almost always it's something I've already seen in countless other hobbyist languages, just a slight variation of it based on the author's personal tastes. It doesn't tell me why I should adopt it over language X. What I'd like to see is exploration of novel practical ideas that would make certain types of projects much faster to write/read compared to most other languages.

For example, a typical web service I work on:

    - uses JSON APIs
    - it's fully stateless (uses external DBs/caches for persistence)
    - has the concepts of value objects, entities, architectural layers (app, domain, infra), ports/adapters etc.
    - only entities are proper rich objects, while most of the code is stateless services that operate on requests + entities + value objects
    - stateless services are composed (via interfaces) into a dependency tree (stored in the dependency container)
Currently I'm playing around with an idea for a language that makes writing things like that fast and compact to read. Something like:

    module my_service

    layer app {
        service Adder {   // stateless service
            uses base int // a value-based dependency, injected in the container below

            method add(x int) int {
                return base + x
            }
        }

        service Doubler {
            uses a Adder  // delegates to another service

            method double(x int) int {
                return a.add(x) + a.add(x)
            }
        }
    }

    container {       // dependency container construction with injections
        A = Adder { base: 10 }
        D = Doubler { a: A }
    }

    // automatically generates a web server that exposes a JSON API with method "double" and accepts the "n" argument
    endpoint double(n int) int {
        return D.double(n)
    }
This is a synthetic example, but you get the idea (entitites and value objecst omitted here)

What do you think? Does it make sense? It basically moves something usually implemented by a framework into the language, but that's the entire point: a language optimized for writing compact, architecturally safe stateless services in a few lines of code. For example, since we know a request's memory is bound to that request (no global state), we can have very optimized memory management without a full GC => improved latency. Or for example, we can have compile-time checks for things like dependency direction validation (i.e. the domain layer cannot reference the infrastructure layer) to keep the architecture clean, etc.


I like these hobby languages just because they help expose and experiment with interesting higher level language constructs. Because of that, I don't really care if they try to sell me on the language or not.

As for your concept, I think this is super interesting. A language catered towards higher level abstractions that we use for web services these days is very appealing. The service and container constructs are particularly enticing.


I would recommend building a macro system and/or library for an existing language - that most closely aligns with your goals.

It seems like your goal is to make things more declarative / readable.

Creating a language is a pretty large undertaking, and unless you need to do it to achieve your goals, I wouldn't recommend it - unless you really just want to see what it's all about and make one.


>but there remains a seemingly obvious use case for non-latin languages to do things from scratch

>see sarvam.ai and their tokenisation improvements on local languages

You don't need to build from scratch to improve tokenization, though.

Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).


the improvements for sarvam was with the amount of tokens used to represent words in english vs non-english languages.

the great thing about the current momentum is that someone can test this hypothesis by applying the T-Bank approach to the same set of languages and compare outcomes.

unfortunately not everyone has the same level of respectable compute this easily available. at least those outside of the ZIRP/VC ecosystem of the valley.


Isn't that how many people program too? I remember some idea or pattern from previous projects, or something I read about on the internet. Then I code it in the most straighforward way, whatever comes to mind first. Then I sit back and analyze: does it look good architecturally? Do I like it? Does it even compile? Then I rewrite some parts to make it more sound. Rinse and repeat, until I'm satisfied. I usually don't come up with entirely novel ideas on the first attempt. I usually just rehash known concepts over the course of many iterations.


Its absolutely valid, my point was just about the "agents are way more than that" part which simply isnt true.

I try to not re-iterate too much, but maybe thats due to me not wanting to work and working for a startup so time and motivation are hard to find.


I administer a simple AI server in the office, which just uses a single RTX 5090 but is able to serve ~80 people throughout the day. I'm impressed by Qwen3.6-27b's capabilities in agentic coding/tasks so far. Devs say it's not much different from Sonnet 4.6 on many tasks (sometimes it even outperformed it), 40-60 tok/sec, up to 260k context. The server cost about $10k with all the bells and whistles.

I spent a lot of time researching/adding/benchmarking many custom modifications to the software stack and its settings to make the server optimally handle the load with just 1 RTX 5090 without losing quality, but it's still not enough, and the wait times in the queue are getting longer. We're at the limits of the hardware, and I'm out of tricks.

The experiment was kind of a success, and the CTO agrees we should scale it. With our own infra, we could run agents 24/7 on everything. Currently, a lot of use cases for the cloud providers are completely blocked by PII/trade secret concerns (our infosec department doesn't buy the "zero retention" promise), plus you don't have to think about billing/budgets/etc. anymore.

Now I can't decide how to scale it. On one hand, I'd like to run larger models. And we have the budget to buy, say, 8xH200. But in many benchmarks, the larger models that do fit in 8xH200 comfortably and can serve many parallel requests with acceptable speed/quality don't seem to outperform Qwen3.6 that much in agentic coding/tasks to justify the price.

So another option is just to buy a bunch of RTX 6000s and scale horizontally instead: run a copy of a midrange LLM like Qwen3.6 on each GPU. It's cheaper and easier to scale/replace, but then we'll run into problems running larger models in the future if we have to, because of no NVLink support (say, if Alibaba & Co. stop releasing ~30b models and/or ~30b models start falling behind 400b+ models considerably)

Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)


> our infosec department doesn't buy the "zero retention" promise

They are wise to be skeptical! It is neither a promise nor zero data retention.

Look at Anthropic's Zero Data Retention policy -- and remember, this is the policy that applies to the exclusively eligible enterprise partners who can even qualify for a ZDR agreement with Anthropic:

> When ZDR is enabled, prompts and model responses generated during Claude Code sessions are processed in real time and not stored by Anthropic after the response is returned, *except where needed to comply with law or combat misuse*.

> Even with ZDR enabled, Anthropic may retain data where required by law or to address Usage Policy violations. If a session is flagged for a policy violation, *Anthropic may retain the associated inputs and outputs for up to 2 years*....

This means that Anthropic is actively inspecting all of your data with machine learning classifiers. When the usage is flagged for whatever reason as violating any aspect of Anthropic's Usage Policy, then they get to keep your data for 2 years, with no apparent limitation on what they can then use it for.

Crucially, you have ZERO guarantees about the sensitivity or specificity of these classifiers. For all anyone knows, Anthropic is silently flagging 75% of queries and retaining the data.

https://code.claude.com/docs/en/zero-data-retention


I wonder how aws handles this in bedrock. Do they use Anthropics classifiers? Or their own? Or none? Would their data policing be different in bedrock than their other services?


I think it’s a cost/opportunity tradeoff at best with any agreement, regardless. The rest of the contract may make it difficult to impossible to do anything about it, starting with basic arbitration clauses and ending in a ton of other provisions that can make any legal action futile. I doubt there’s much room to negotiate too.

Given that all labs need to diversify to become profitable, they’ll end up competing with their customers and theres nothing that exposes a business more than having AI offload every job function for every account, every mail etc.

Assuming this won’t be an issue is naive at best.


I have a 5090 machine sitting idle that I'm considering turning into a machine for my own small team (3 devs).

Are you willing to share any lessons learned, etc. that I could make use of? We are evaluating paying for a SOTA sub or trying this, and the talk about Qwen3.6-27B makes me want to try deploying this machine.


Sell the machine for $4K, use it to pay for Codex Pro for everyone for a year. Everyone will be significantly more productive and happy.

It's not even a real comparison if they are actually using them for coding.

If you are deploying always running agents (e.g. monitoring logs and services) then sure - a QWEN local server is a good choice. But for coding the cost in productivity of using a lower performing model is way too high.


The 5h quota of Codex Pro on GPT 5.4 Medium lasts me for around an hour and a half, maybe 2 hours. And this is already the "savy" setup. Enable GPT 5.5 High fast and you will be beached in 30 minutes with active development.

For continues all day work you definitely need a higher tier sub level.

I'm actually looking into deploying a GPU at my company because we can not give out our code. Qwen 3.6 looks good


this might be true for the plus account. For the "Pro" tier ($100-$200/month) the 5h limit is never a problem.


Right, I did swap that. Still, you have to pay that 4k then every year and give out the code. I also assume that prices will go up as no AI company (but NVIDIA -> selling shovels) is currently making any money.

For some projects the giving out the code part might be ok (i use Codex there too) but for the core app at the company I'm working at there is currently a strict no-AI policy. A local GPU solves this.


Anyone who frivolously suggests throwing away possible independence in favor of dependence on a Silicon Valley company is either incredibly naïve or acting in bad faith.


Not necessarily so. I can see how a bid to predict how thing will be in 1 year in AI-based coding is likely a losing one. So the idea is to extract the maximum value now, and turn it into profits that would buy you whatever is adequate for the next steps. For comparison, the AI-based coding landscape a year ago, in May 2025, wasn't even close to what we have now, and half the key tools did not exist.

OTOH, as we see, the larger models demonstrate diminishing returns, smaller models demonstrate improvements, and hardware does not show any signs of becoming cheaper, so holding on existing decent GPUs may, too, be a winning strategy in longer term.


I'll choose not to respond to your personal attack.

But in term of actually running a dev team - you are free to use QWEN or another quantized local model that can run on an RTX 5090 for coding if it makes you feel more independence. However you would struggle and spend many many more hours achieving the same thing, with a lot more debugging time, long delays before it's done, and many more prompts.

It's just not the right approach. I use QWEN and other local models all the time, but for more clearly defined monitoring and classification tasks.


> Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)

Join the RTX6kPRO tribe!

- https://discord.gg/pYCvaQTf

- https://github.com/local-inference-lab/rtx6kpro


Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)

For what it's worth, I've been seeing ~100 tps with 4-bit MiniMax 2.7 on two RTX 6000 boards, just running under llama-server without any optimization effort at all. I have no serious long-context experience with that setup, but at 30K context it's still above 90 tps.

If you are happy with Qwen 3.6 27B, I would personally switch the 5090 out for 2x RTX 6000s and keep running 27B. That will give you ~2x your current throughput with a lot more headroom for multiple users. More important, it would buy time to see how things develop over the next few months before you spend a whole lot of money.


With that amount of memory can you run 4-bit DeepSeek 4 Flash? It is way more efficient in the KV cache department so may be worth a try


I haven't looked into DS4 yet but based on antirez's results on 128 GB Macbooks, it shouldn't be a problem to run it on a pair of RTX6000 Pros.

Also see https://www.reddit.com/r/LocalLLaMA/comments/1sv649s/to_run_... .


How can a single 5090 serve 80 people? Something doesn't add up here.


They don't use the server all at once. In the UI, users typically ask a question, get a response, and continue with their work. In the case of autonomous agentic loops, an agent simply waits its turn until the server is ready to accept the request. Agents don't hammer the server 24/7 every second either, because they either need to be triggered or are busy doing other work, such as compiling or running tests.


It would be more interesting to know how many simultaneous users this setup can serve. Otherwise I can just say it serves 500 users but not all of them use it at the same time which doesn't communicate the right level of detail.


Depends on TTFT and tokens per second you want.


I also call this "bollocks" there is no way this workflow is even 1/10 of what you can get with Codex/Claude Code.

A normal engineer may be running a couple of sessions with every session spawning sub agents left and right.

80 persons or even 10 having this workflow on this setup doesn't work, and this is the standard engineer workflow today.


Subagent swarms are actually great for the local inference scenario because they can share a whole lot of KV cache. You get to raise the compute intensity of decode (i.e. the aggregate tok/s) essentially for free.


Hum I normally am doing a clean context for the sub agent. If I want my context I do it in the main session, if it’s side work I want a clean small context with just the directions.


With parallelism of 16 you can still get around 25 to 30 tokens per user when all 16 channels are running. Not everyone will use the model at the same time but it certainly will be tight, especially for agentic coding. For pure chat applications this should be quite fine.


The problem with wide parallelism with most models is that it blows up your KV cache. There's open models with KV caches lean enough to parallelize inference or even to offload the KV cache itself to disk without immediately running into wearout concerns, but they're quite exceptional.


They are using it as an assistant, bot running multiple fully automated agents loops?


Wouldn't that be a fairly ideal setup for layer parallelism? That doesn't need the high-performance communication of tensor parallelism, and the high-concurrency regime would make it easy to keep the pipeline full with microbatches. You'd also be able to scale out your KV cache storage since that naturally splits layer-wise.


> don't seem to outperform Qwen3.6 that much in agentic coding/tasks

idk i imagine you'll hit less edges with a larger model just because.. more data

if you think of them as a kind of NN compression, it's ~obvious that the larger model can have more stuff encoded in it and hopefully accessible

i don't use LLMs much right now but using midrange models seems like an unnecessary compromise in most cases, especially since the big open models sound to be rivaling opus and not just sonnet :p


Qwen 3.6 27B is fine but it's not in the same ballpark as GLM-5.1 or Kimi K2.6.

If you truly want to scale up, you should get the 8xH200 with NVLink.


Thank you for the insight. This makes me feel confident, the L40S we are about to acquire with 48GB VRAM for engineering application should be useful for agentic coding as well.


I thought NVLINK didn't matter anymore because of the latest PCI-E speeds. Am I wrong there?


Are we talking 1 GPU or 8?


> 260k context

with a single 5090?


Yep, Gated DeltaNet in Qwen3.6 requires much less VRAM for the KV cache than previous generations. Plus the KV cache is 8-bit.


is it in llama.cpp?


>But the results are the same. Reforged models do better than bare, even at those sizes

>I haven't published those evals yet

Don't forget to post the complete settings for those evals, please, because local LLMs' failure modes are often caused by incorrect setups (bad quants, bad chat templates, non-recommended temperatures, ridiculously small context, not enabling "preserve thinking" etc.). In my setup I've never seen Qwen3.6-27b get truly stuck so far. What it usually gets wrong are poor architectural decisions or forgetting to update something.


Good call! The latest forge version has per-model-parameter configs sourced from official sources (can be overridden), that's what I'll use for evals and each eval set will be paired with a commit hash. But I'll make sure to call out the location of the params and maybe highlight some for the popular models.

For the paper - more academic in nature - I wanted to isolate the model performance variable from guardrail lift. The delta is what mattered more than final score. For the paper, everyone got temp=0.7 - that was intentional.

As for Qwen3.6, it's really solid. It'll do really well on forge I can call that now. When I pushed it into agentic coding specifically and the eval suite I use there (separate from forge), even it needed help on long-running tasks - but it's definitely a top model right now.

However, entirely possible there are better settings than the "official recommendations" I found - which would be a neat finding in itself.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: