Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.

[0] https://huggingface.co/unsloth/GLM-5.1-GGUF



SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.

Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.


For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data.


Three hour coffee break while the LLM prepares scaffolding for the project.


Like computing used to be. When I first compiled a Linux kernel it ran overnight on a Pentium-S. I had little idea what I was doing, probably compiled all the modules by mistake.


I remember that time, where compiling Linux kernels was measured in hours. Then multi-core computing arrived, and after a few years it was down to 10 minutes.

With LLMs it feels more like the old punchcards, though.


At least the compiler was free


The point of doing local inference with huge models stored on an SSD is to do it free, even if slow.


You are just trading opex for capex. Local GPUs aren't free.


True, but this is not only a trade-off between opex and capex.

Local inference using open weight models provides guaranteed performance which will remain stable over time, and be available at any moment.

As many current HN threads show, depending on external AI inference providers is extremely risky, as their performance can be degraded unpredictably at any time or their prices can be raised at any time, equally unpredictably.

Being dependent on a subscription for your programming workflow is a huge bet, that you will gain more from a slightly higher quality of the proprietary models than you will lose if the service will be degraded in the future.

As the recent history has shown, many have already lost this bet.

I am not a gambler, so I have made my choice, which is local AI inference, using a variety of models depending on the task, i.e. both small models completely executable on relatively cheap GPUs (like the new Intel GPUs), medium models that need e.g. 128 GB on a CPU, and huge models that must be stored on fast SSDs (e.g. interleaved on multiple PCIe 5.0 SSDs).

Such a strategy is achievable with a modest capex, in the lower half of the 4-digit range.


I agree in principle that more democratic compute = better and third parties introduce additional risk that is outside of your control. That said I just don't see it working economically - either you have an underpowered GPU (4-digit range) at which point you have weak model, or slow model, probably both weak and slow. Or you have expensive GPU cluster, but at that point you also need to consider utilization as you are probably not streaming tokens out 24/7 and at that point TCO is just drastically more expensive for self hosting.

Personally I hope we see a third way - strong open weight models hosted by variety of companies actually competing on price and 9s of availability. That way capex expensive GPUs are fully utilized and users can rent intelligence as a commodity.

There is a very apt analogy to virtual server hosting - hosting vps/shared web is a commodity, it does not make financial sense for most users to host their website on their own physical servers in their basements.


Rather, Imagine you have 2-3 of these working 24/7 on top of what you're doing today. What does your backlog look like a month from now?


[flagged]


@dang


Batching many disparate tasks together is good for compute efficiency, but makes it harder to keep the full KV-cache for each in RAM. You could handle this in an emergency by dumping some of that KV-cache to storage (this is how prompt caching works too, AIUI) and offloading loads for that too, but that adds a lot more overhead compared to just offloading sparsely-used experts, since KV-cache is far more heavily accessed.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: