More

dakolli · 2026-05-11T15:20:21 1778512821

Cloudflare for email people, its the best and free

xp84 · 2026-05-11T16:34:41 1778517281

Can you clarify this statement please?

traderj0e · 2026-05-11T17:49:20 1778521760

I'm confused too. Cloudflare is a DNS, anti-ddos, CDN, and light cloud

dakolli · 2026-05-11T19:00:59 1778526059

https://www.cloudflare.com/developer-platform/products/email...

petu · 2026-05-11T19:37:06 1778528226

But it's only for receiving and you still need external mail server/address?

dakolli · 2026-05-11T21:14:14 1778534054

https://developers.cloudflare.com/email-service/get-started/...

dakolli · 2026-05-11T15:16:41 1778512601

My grandpa retired as an IL police officer in his 50s and lived for 30+ more years making 6 figures from his pension and getting 3% or 5% (I forgot) adjustments every year. He probably had the most chill retirement of anybody I've ever known (outside of getting cancer twice). He was making six figures a year living on a lake near Dixon, you do not need six figures in Dixon lol

dakolli · 2026-05-10T21:46:15 1778449575

So you're running an llm to do data transformation that deterministic processes would be much better suited for and running 1,000 watt power supply to do so. Wild.

dakolli · 2026-05-10T21:20:27 1778448027

This is simply delusional, It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.

To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.

I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.

You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).

People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.

Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.

Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.

zozbot234 · 2026-05-10T21:29:45 1778448585

No one runs SOTA models 24/7 for individual use or even for a single household or small business, whereas you can run your own hardware basically 24/7 for AI inference.

With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.

This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.

doctorpangloss · 2026-05-11T02:40:53 1778467253

deepseek v4 flash on mlx at 1m context runs at 20 t/s decode on a mac studio m3 ultra with 512gb of RAM

alfiedotwtf · 2026-05-11T06:37:44 1778481464

What is everyone running DeepSeek v4 Flash with?!

It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!

zozbot234 · 2026-05-11T07:47:19 1778485639

https://www.github.com/antirez/ds4 (from Antirez of Redis fame) runs a 2-bit quant on Apple Silicon hardware and 96GB or 128GB RAM.

alfiedotwtf · 2026-05-11T16:07:16 1778515636

I've been keeping an eye on Antirez's Metal fork for llama.cpp, but I totally missed this. Whoa, nice. Giving it a go, thanks!!

zozbot234 · 2026-05-11T16:29:50 1778516990

What kind of hardware are you planning to run this on? As mentioned already, I've been trying to understand how gracefully it might degrade on 64GB RAM or perhaps lower (the total weights size is 80GB at the provided quant) using SSD offload for the weights, and then (assuming it works and doesn't just OOM) whether the tok/s figures might meaningfully improve in that scenario by running multiple sessions in parallel.

alfiedotwtf · 2026-05-11T17:19:09 1778519949

I've got a 4060 Ti 12Gb with 128Gb RAM. I was hoping once I could demonstrate to myself that I could run Deepseek v4 Flash locally (even at really slow speeds), then it would be worth my time and money to get something to run it > 20t/s.

... currently testing out Stepfun 3.5 Flash Q4_k_m as a stop gap (unless it blows my socks off first).

zozbot234 · 2026-05-11T18:51:38 1778525498

I don't think the DS4 project supports the CPU/GPU split approach you'd need for best performance on that kind of hardware (shared layers on GPU, most experts on CPU). CPU-only inference would work but might be slow.

alfiedotwtf · 2026-05-11T22:48:27 1778539707

Ah dang. Hmm, damn this hobby is expensive. Maybe I should just take up drugs instead

doctorpangloss · 2026-05-11T16:27:11 1778516831

you can run it today with mlx if you have 256g or 512g mac studio. no "antirez" fork needed.

it isn't that large of a model and the compressed kv implementation is not that complicated

the problem is that they released the model in a quantized format that is more complex than it appears, and people make a lot of mistakes working with it. it is quantization-aware-trained, so you can't "just" upscale it and scale down.

vllm runs dsv4 flash fine right right now

dgx sparks cannot really run it correctly right now with released vllm but there are PRs, it's just a matter of time. you would need 3 of them. they will still be almost 1/2 as fast as the mac studio.

so the punchline is, well, this is why the 512g mac studio is such a hot commodity right now.

zozbot234 · 2026-05-11T16:36:53 1778517413

If you have a 256 GB or 512 GB Mac Studio, the real game is to run multiple sessions in parallel in order to make the best use of your limited memory bandwidth. You'd have plenty of excess RAM for that given how small the KV cache is even at max context.

alfiedotwtf · 2026-05-11T17:20:44 1778520044

Unfortunately I didn't get a Mac with big ram at the time it was cheap, and I'd personally focus on moving away from Apple and going Linux fulltime at work and home (currently Macbook for laptop connected to my big rig, well it's not that big compared to the AI people in here).

zozbot234 · 2026-05-11T18:57:49 1778525869

What kind of RAM does your MacBook have? It might still be worth experimenting w/ DS4 using disk offload, though it would be dog slow at best and the RAM would be much too limited for meaningful parallelism, especially for larger contexts.

alfiedotwtf · 2026-05-11T22:49:14 1778539754

This might be my only hope until RAM prices come down to human levels again

dakolli · 2026-05-11T03:05:05 1778468705

Just because you read it on a github repo doesn't make it true, it also doesn't take into account cpu temps and inevitable throttling you'll encounter.

doctorpangloss · 2026-05-11T03:34:46 1778470486

i ran it on my own device haha

i don't comprehend why people are in such disbelief at how much better this stuff runs on a mac studio than on NVIDIA hardware with 1/5th the VRAM. look, what can i say? NVIDIA is a bigger rip off than Apple is!

platevoltage · 2026-05-11T03:44:12 1778471052

Which is good, because Nvidia pulling a Micron and ceasing consumer hardware production is right around the corner.

NitpickLawyer · 2026-05-10T22:04:22 1778450662

API prices are most likely not subsidised. A brief look at openrouter can tell you that. There are plenty of providers that have 0 reason to subsidise that sell models at roughly the same average price. So the model works for them (or they wouldn't do it otherwise).

ai_fry_ur_brain · 2026-05-10T23:04:18 1778454258

They are subsidized, heavily. This is simple math, there are lots of reasons to subsidize. Please go look up the hardware requirements to run your favorite model and a given tok/ps then multiple that by 86400 (seconds in a day) then divide that by 1mm and multiple by the $ per mm tokens, then ask yourself if there's any possibility they could be profitable or even close to break even.

You are going off vibes alone, this is easily verified, please go verify.

What makes you think they have zero reason to subsidize, because the providers aren't a household names you assume they wouldn't operate at a loss? Whats your logic here? You make no sense.

gpugreg · 2026-05-11T09:02:57 1778490177

Serving a single user is likely not profitable, but total throughput rises a lot when serving many concurrent users, because the same weights can be used to generate tokens for all users at once, which increases efficiency.

Also, a lot of money is being made on input tokens and cached tokens, which are much cheaper to compute.

DeepSeek published their math for serving the V3/R1 models. They were 535% profitable: https://github.com/deepseek-ai/open-infra-index/blob/main/20...

hibikir · 2026-05-11T02:26:59 1778466419

The amounts of API tokens many large companies are using through, say AWS bedrock are quite high. We've seen leaks on the bills for real world use cases. It's not unreasonable to see normal individual subscriptions as possibly subsidized.... but do we think someone like Anthropic is going to be subsidizing 7, 8, or even 9 figures monthly bills from megacorps? Because said megacorps will swap out to a competitor immediately, so your subsidy is unlikely to lead to loyalty or anything.

If Anthropic and OpenAI are subsidizing the metered API usage, their model is going to end up just as successful as MoviePass. They are burning enough money on the training costs already.

dakolli · 2026-05-11T02:54:57 1778468097

Large companies are paying an arm and a leg, but I'm still certain even at $15.00 per million tokens they are not profitible.

If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.

This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.

NitpickLawyer · 2026-05-11T04:26:37 1778473597

You are all over this thread, but you have no idea how inference works, and it's obvious. Your napkin math is off because you don't know what to add up, you lack the necessary background. And yet you persist and reply all over this thread. I don't get it.

Serving models on dedicated hardware is not the same as your at home 150t/s thing. Inference is measured in thousands of tokens / s in aggregate (i.e. for all the sessions in parallel). That's how they make money.

CuriouslyC · 2026-05-11T02:06:37 1778465197

Anthropic and OpenAI make money on API calls, margins have been reported in public filings. Subs are subsidized.

dakolli · 2026-05-11T02:56:00 1778468160

That's not possible, read my comment above. These are private companies, there are no public filings regarding their profitability in any sense. You're just making things up.

If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.

This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.

mtone · 2026-05-11T03:59:13 1778471953

You're forgetting a critical factor: concurrency. If a given hardware serves a single request at 150 tokens/s, it can also serve 20-30 requests at 100 tokens/s. Suddenly your $5K becomes $100K/month, enough to recoup the cost of the hardware in a year or so.

The reason it works: each time you read the model (memory bound) to calculate the next token, you can also update multiple requests (compute bound) while at it. It's also much more energy-efficient per token.

[1] https://aimultiple.com/gpu-benchmark

dakolli · 2026-05-11T04:52:38 1778475158

Interesting I didn't know about this, but it makes sense after reading the article. They are benchmarking on a single GPU on a 20bb param model. Does it scale across 60 H100s over NVLink/NVSwitch. I would be interested to see those benchmarks.

The idea that everyone is spinning up a $2 million in GPUs to scan their email inbox, search the web or avoid learning something is still ridiculous to me regardless.

vachina · 2026-05-11T11:11:23 1778497883

Training to be artisanal coder now.

CamperBob2 · 2026-05-10T22:04:05 1778450645

It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.

Not if you're OK with 4-bit quantization. More like $30K-$50K one time.

Spring for 8 RTX6000s instead of 4, and you can use the full-precision K2.6 weights ( https://github.com/local-inference-lab/rtx6kpro/blob/master/... ).

reissbaker · 2026-05-10T22:32:08 1778452328

RTX 6000 Pro retails for $10k so an 8x is $80k before anything else in the computer, and long-context will have... pretty bad performance (20+ seconds of waiting before any tokens come out), but it's true it technically works.

I don't think cloud models are going away; the hardware for good perf is expensive and higher param count models will remain smarter for a looong time. Even if the hardware cost for kind-of-usable perf fell to only $10k, cloud ones will be way faster and you'd need a lot of tokens to break even.

zozbot234 · 2026-05-10T22:48:52 1778453332

> I don't think cloud models are going away; the hardware for good perf is expensive

I think local AI will win in its niche by repurposing users' existing hardware, especially as cloud hardware itself gets increasingly bottlenecked in all sorts of ways and the price of cloud tokens rises. You don't have to care about "bad" performance when you've got dedicated hardware that runs your workloads 24/7. Time-critical work that also requires the latest and greatest model can stay on the cloud, but a vast amount of AI work just isn't that critical.

reissbaker · 2026-05-11T02:53:28 1778468008

Users do not have an existing $80k of hardware, are not going to buy $80k of hardware for worse performance than paying $100/month, and models are continuing to grow in size while memory grows in price.

zozbot234 · 2026-05-11T07:44:23 1778485463

You said you need $80k in hardware for "good performance". I'm saying the local AI inference workflow will be a lot more flexible about performance than that, and can get away with something vastly cheaper and in line with what the user owns already.

otabdeveloper4 · 2026-05-11T05:40:17 1778478017

> paying $100/month

There will not ever be a monthly subscription for LLM tokens. The economics isn't there.

Local tokens will always be cheaper.

reissbaker · 2026-05-11T20:07:36 1778530056

There already are many subscriptions for LLM tokens: OpenAI, Claude, Synthetic (shameless plug), Zai...

I'm not sure what you mean by "There will not ever be a monthly subscription for LLM tokens." That already exists!

entrope · 2026-05-11T10:40:15 1778496015

What's the basis for saying local tokens will always be cheaper? As others have outlined, LLMs serving one user at a time are pretty expensive, but concurrent users become much more cost-effective (assuming there's enough RAM for the contexts). If "local" to you means ~10 hours daily use by a team of employees, the company still has to balance against cloud services that can amortize non-recurring costs over 24 hours of service per day.

zozbot234 · 2026-05-11T11:31:24 1778499084

Why would a team of employees not be able to run AI workloads 24/7? Not all workloads are time sensitive.

entrope · 2026-05-11T13:20:26 1778505626

Both my experience, and Anthropic's off-peak promotion, indicate that there are very uneven levels of demand for peak hours versus off-peak hours. How close do you think they are?

zozbot234 · 2026-05-11T15:39:22 1778513962

But that's demand for cloud inference that's priced on a flat-rate basis with some adjustments (like "off-peak hours"). Not a local rig where inference is effectively free aside from the cost of power whenever the system isn't congested.

reissbaker · 2026-05-11T20:15:10 1778530510

The local rig is not free and requires very large capital expenditures while producing very low token throughput for large models. Within any time budget, you can get many orders of magnitude more large-model tokens off an 8xB200 than off a local rig. Therefore cloud tokens have a huge capital efficiency advantage over local rigs. That will continue basically forever, since there will always be large cloud companies willing to spend millions of dollars for more capital-efficient hardware, so Nvidia and friends will continue to spare no expense producing it, meaning the cloud hardware will be way too expensive if you're not a large inference company. You can also buy local rigs, but they will be less capital efficient per token, not more.

(This is a generous argument: it also ignores the massive software stack optimization the cloud companies do that doesn't trickle down to local-rig-sized deployments; for example, prefill/decode disaggregation, which would double the VRAM requirements for a local rig — if you could even do it on a local rig, which you can't, because local rigs don't have Infiniband. But at scale, prefill/decode disaggregation improves capital efficiency, since you can tune the compute-bound prefill node differently than the memory-bound decode node.)

The advantage of local rigs is not capital-efficient tokens. It's privacy. But then again, you can get zero-data-retention options from many inference companies, so for many use cases it may not matter unless you need strict guarantees the data never leaves the building...

zozbot234 · 2026-05-11T22:21:31 1778538091

> The local rig is not free and requires very large capital expenditures while producing very low token throughput for large models.

Sometimes it really is free though, because the hardware was bought to serve some other existing needs and that capital expense was fully depreciated quite some time ago. Underutilised hardware is essentially ubiquitous.

> Within any time budget, you can get many orders of magnitude more large-model tokens off an 8xB200 than off a local rig.

But using that 8xB200 setup to run inference on cheap, non-frontier models is a plain waste. Its highest and best use is in an AI datacenter serving exceptionally smart models like Gemini DeepThink, GPT Pro or Claude Mythos. (If this isn't true, it means that the current level of large-scale investment in frontier, super intelligent AI is misplaced, and you should worry about that; not whether some models are best ran on lower-end hardware!)

ai_fry_ur_brain · 2026-05-10T23:05:13 1778454313

"I think"

Well your thinking is completely vibes based and not cemented in any reality I exist in.

CamperBob2 · 2026-05-11T01:46:38 1778463998

Other sites beckon.

otabdeveloper4 · 2026-05-11T05:25:34 1778477134

> higher param count models will remain smarter for a looong time

They're not smarter, they just know more stuff.

You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.

The "smarts" comes from post-training, especially around tool use.

anon7725 · 2026-05-11T05:56:55 1778479015

If the smarts came from post-training, we could show significant gains by doing that post-training again for previous generations of models. But we know that isn’t happening - effective post training is necessary but not sufficient for model performance.

otabdeveloper4 · 2026-05-11T17:48:25 1778521705

> we could show significant gains by doing that post-training again for previous generations of models

That's what Chinese models are doing, and beating Opus et al.

CamperBob2 · 2026-05-11T16:35:57 1778517357

You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.

That's one of the biggest remaining head-scratchers in this whole business. You do need all that unrelated stuff to make a good coding model.

Nobody knows why you can't build a coding model by training on nothing but code, CS texts, specifications, and case studies, but so far it appears that you can't.

alfiedotwtf · 2026-05-11T06:44:25 1778481865

If 8 x RTX 6000 is getting you 20s before initial token, how are cloud vendors doing this?

CamperBob2 · 2026-05-11T16:39:50 1778517590

RTX6000s are great but they are several times slower than a real datacenter-grade GPU. They still use DDR memory rather than HBM, for example.

zozbot234 · 2026-05-10T22:04:55 1778450695

4-bit quantization is native for Kimi 2.x series.

CamperBob2 · 2026-05-10T22:15:04 1778451304

You're right, I was thinking of Qwen. K2.6 will run at UD-Q2_K_XL precision on 4x RTX6000 boards, but I have no idea if it's worthwhile.

hparadiz · 2026-05-10T21:28:22 1778448502

Posts like this are so funny to me. I'm staring at a mountain of old hardware right now that cost about $20k ten years ago. I have to pay someone now to come haul it away. What makes you think the current new hardware won't end up with the same fate.

> Just write your own fkin code people

Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour.

HWR_14 · 2026-05-11T05:36:37 1778477797

Do you have any old laptop ram?

hparadiz · 2026-05-11T05:46:22 1778478382

It's old rack mounts. Only one of them has some ECC DDR4 worth something.

dakolli · 2026-05-10T21:36:33 1778448993

I'm just saying that agent that can fix your bugs actually cost $100-150 an hour to run and you're getting it essentially for $200.00 a month.

The cost of cloud compute actually hasn't gone down for old hardware all that much, it still costs $500.00 a year rent 4 core i7700k that's 10 years old. Don't expect much more valuable hardware, like modern GPUs to deflate in price all that quickly.

There's 3 fabs in the world that make ddr7 and they aren't going to be selling their stock to consumers going forward, it will be purchased by datacenters almost entirely and stay in them until EOL.

Your brain is going to atrophy (this is proven), they'll raise the price to something thats closer to break even and you'll be forced to pay it because you no longer have those muscles.

hparadiz · 2026-05-10T22:00:30 1778450430

The architectural problems I deal with day in day out leave no room for atrophy. This is just cope.

platevoltage · 2026-05-11T03:49:18 1778471358

You're going to see major cope once that bargain $200/month plan goes away, and every person or company that has embedded these services into their workflows gets to see their actual costs.

hparadiz · 2026-05-11T05:01:13 1778475673

Have you actually tried this stuff or are you just saying stuff you hear on the internet?

platevoltage · 2026-05-11T17:25:09 1778520309

Yes. I have tried this stuff. I really don't see how my use, or non-use of AI APIs changes this reality. Github Copilot announced it's going to per-token pricing in less than a month. I heard that on the internet by the way.

hparadiz · 2026-05-11T18:10:21 1778523021

I'm watching people host models on things like LangSmith and OpenRouter for a fraction of the cost you are talking about. We have other people reporting their M4 Macs providing them with performance close to what they get with ChatGPT and Claude all locally with just a 24 GB M4 Mac. We already spend money on laptops. I can put in a ticket for an M4 Macbook Pro from IT right now.

platevoltage · 2026-05-11T19:38:41 1778528321

Fantastic. I run local models too. This was specifically about APIs

hparadiz · 2026-05-11T19:49:32 1778528972

Why do you think costs will go up when the competition is increasing while hardware prices per compute cycle go down? That's the part I don't get.

nullc · 2026-05-10T21:36:27 1778448987

> two 4090s is not consumer grade

I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?

I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI.

ac29 · 2026-05-10T22:58:08 1778453888

> Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?

$50k is a median priced car in the US. I'd guess >99.9% of people do not own $4000 of GPUs. I consider myself a computer person and I dont think I even own $4000 of computer hardware in total

swiftcoder · 2026-05-11T06:35:41 1778481341

> I consider myself a computer person and I dont think I even own $4000 of computer hardware in total

A top-spec MacBook Pro is >$4k, so I assure you that plenty of computer people do own $4k of computer hardware.

Hell, most tech folks are wandering around with a ~$1k smartphone in their pocket too.

janalsncm · 2026-05-11T05:25:14 1778477114

Fwiw you can finance a car over something like 7 years now. So a lot of people will be paying like $750 per month, not $50k lump sum.

zozbot234 · 2026-05-10T23:04:38 1778454278

Plenty of gamers own serious GPU rigs that are reusable (at least to some extent) for local AI inference. That's almost certainly more than 0.1% of the populatiom.

nullc · 2026-05-10T23:17:56 1778455076

I guess I wasn't clear-- I wasn't so much making the point people do own $4000 in GPUs (though I suspect you are massively underestimating the number who do, also before the current market conditions this would have been more like $2500 in gpus...), but they certainly could per the evidence of car ownership.

A car is super useful, so is an AI. But even if we decide cars are incomparably more useful a great many people pay much more than $4000 over the minimum viable car, and that's money that could be deployed to secure access to private, secure, and autonomous AI facilities. A few thousand dollars in computing is consumer hardware, or at least could easily be with more reason and awareness driving adoption.

People spend a LOT of money in things less useful than local copy of qwen3.6-27b can be.

dakolli · 2026-05-10T21:43:04 1778449384

I would still question what usefulness there is with a local model even with 10k in GPUs. I certainly haven't seen any great uses myself from these smaller models (<500 parameters) except claims from people who are totally enamored with AI and basically anything output from an LLM impresses them like a toddler who's entertained by the sound their velcro shoes makes.

robot-wrangler · 2026-05-10T22:40:24 1778452824

Probably you're focused on coding agents? I bet someone could use that kind of hardware to filter snarky comments

nullc · 2026-05-10T22:20:17 1778451617

Here is an example-- I'm running hermes + qwen3.6-27b on a workstation GPU (an older RTX A6000 which gets 55tok/s, though people run this model on more limited hardware).

A friend an I had previously worked on an entropy extraction scheme and he recently got around to making a writeup about our work: https://wuille.net/posts/binomial-randomness-extractors/

I instructed the agent to read the URL, implement the technique in C++ for 32-bit registers, then make a SIMD version that interleaves several extractors in parallel for better performance. It implemented it (not hard since there was an implementation there that it read), then wrote more extensive tests. Then it vectorized it. It got confused a few times during debugging because the algorithm uses some number theory tricks so that overflows of intermediate products don't matter and it was obviously trained a lot on ordinary code were such overflows are usually fatal. I instructed it to comment the code explaining why the overflows are fine and had it continue which mostly solved its confusion.

It successfully got the initial 12MB/s scalar implementation to about 48MB/s. Then I told it to keep optimizing until it reaches 100MB/s. I came back the next day and it had stopped after 6 hours when it achieved just over 100MB/s. Reading what it did: it went off looking at disassembly, figured out what hardware it was running on, and reading microarch timing tables online and made some better decisions, tried a lot of things that didn't work, etc. (And of course, the implementation is correct).

I'm pretty skeptical about AI and borderline hateful of many people who (ab)use it and are deluded by it-- but I think this experience shows that a small local model can be objectively useful.

(oh and this experience was also while I only had the model running at 19tok/s)

Running the model in a loop where it can get feedback from actually testing stuff allows you to make progress in spite of making many mistakes.

I could have done this work myself but I didn't have to and I certainly spent less time checking in and prodding it than it would have taken me to do it. In my case I wondered how much faster parallel extractors using SIMD might be-- an idle curiosity that would have gone unanswered if not for the AI.

ai_fry_ur_brain · 2026-05-10T23:13:49 1778454829

This is maybe the first time Ive seen someone claim to do something useful with such a small model.

Congrats, but you're in the 0.0001% thats not just frying their brains, fapping to their local models or doing various magic tricks like a toddler entertained by playing with velcro.

At the end of the day you lost an opportunity to improve yourself and excercise your brain, maybe the opportunity cost is worth it idk, but Im going to keep taking things slow.

Handmade swiss watches > mass manufactured immitations. Handmade clothes > walmart clothes.

otabdeveloper4 · 2026-05-11T05:43:11 1778478191

Sounds like you're coping for the vendor lock-in you cornered yourself into.

nullc · 2026-05-10T23:23:59 1778455439

This is a change that's been happening gradually over time-- I don't think I could have done this on a local model that could run on a consumer class gpu a couple months ago.

There are plenty of other uses that people have been making for a long time-- e.g. I know someone who uses a fine tuned local model to sort their incoming email and scan their outgoing messages for accidental privacy leaks.

I don't agree with your assessment on an opportunity lost-- I got my reps in on the original work, the AI gave an incremental step forward which made the whole exercise somewhat more valuable to me with minimal additional cost. I think this improves the cost vs benefit in a way that makes me more likely to try other pointless activities, knowing that when I run out of gas I can toss it to AI to try some variations.

Sometimes you're also 27 steps deep on a nested subproblem and you're really just trying to solve sometime. Even in finr craftsmanship not every step needs to be about maximum craftsmanship. :) Sometimes it's just good to get something done.

I think this is much like any other tool. One can carve furniture using only hand tools, but the benefits of a router are hard to dispute. Both approaches exist in the world and sometimes both are used in concert.

As far as people frying their brains with AI -- you don't need local models for that, plenty of people are driving themselves into deep personally and socially destructive delusion just using the chat interfaces.

ai_fry_ur_brain · 2026-05-10T23:37:55 1778456275

I do think post training smaller open source models for very narrow tasks is largely overlooked and there'll be lots of value there if one puts in the effort. However, in a lot of cases we're just compeleting a circle back to deterministic behavior at 1000x the memory/compute requirements just to avoid writing regex.

I agree with you, there's a way to use them responsibly like your router anology, I just think most aren't doing this correctly and its a slippery slope. I'll contend that you probably have used them responsibly in your example.

dakolli · 2026-05-08T23:51:15 1778284275

LLMs aren't capable of doing this, and never will be no matter what Anthropic tries tell you.

fragmede · 2026-05-09T00:10:53 1778285453

Mozilla seems to think it can.

https://blog.mozilla.org/en/privacy-security/ai-security-zer...

dakolli · 2026-05-09T02:39:16 1778294356

Ahh yes, I'm sure agents did this all autonomously without any human in the loop what so ever. They are useless without experts to handle them.

fragmede · 2026-05-09T08:18:27 1778314707

So then have the Linux-using organizations employ experts to handle them then.

a_vanderbilt · 2026-05-09T00:47:27 1778287647

That's the same mindset some people had 3 years ago when they said AI wouldn't be capable of software development. Look where we are now.

dakolli · 2026-05-09T02:42:41 1778294561

I have unlimited access to every single frontier model, I've tested all of them, they are not good at writing software.

They are basically slot machines, sometimes you win a little bit and sometimes you win a lot but usually you just burn a ton of time and money sitting and staring at a screen (and frying your brain).

dakolli · 2026-05-08T22:29:35 1778279375

Why is every startup using that same Serif font now, Garamond or whatever. Is it an LLM design phenomenon? Its kinda ruining that font style for me.

Also $1,500 a month for 10 "influencers" is wild. This doesn't seem that sophisticated unless they're doing something special to increase trust scores of accounts. They say they have "in house warming algorithm" which honestly doesn't inspire confidence for me.

Whats funny is its almost a certainty (if they are doing things correctly) that they have literal farms of phones (probably in SEA). The only real way to keep trust high is to have a real mobile connection and unique devices. Proxies are okay, but you really need to use the apps on real hardware.

etaioinshrdlu · 2026-05-08T22:55:57 1778280957

I think the font is mimicking old Apple ads, eg: https://i.insider.com/5bf8592eb73c284de50e2f28

dakolli · 2026-05-08T23:47:36 1778284056

Ahh, that makes sense.

alexspring · 2026-05-09T01:22:50 1778289770

Yep. They got hacked in the past, 1k+ smartphones reported.

The cost is the attestation keys of a real phone. Once it gets burned, the phone is useless to them.

https://www.penligent.ai/hackinglabs/inside-the-ai-phone-far...

dakolli · 2026-05-09T02:37:10 1778294230

Interesting article, thanks. I've done a bit of small scale phone farming (for my own cheap mobile proxies). In all reality the phones aren't that expensive, I went with Moto 5gs that cost $130 (retail), so in their case the phones pay for themselves in the first month.

Probably a decent amount of compute cost for video generation, but I'm sure they have access to free compute and inference for being in bed with a16z.

Velocifyer · 2026-05-09T11:37:56 1778326676

If you are OK with carrier locks (eg if you don't need cell service) and are in the USA, you can actually get mot 5Gs for $30 at walmart. https://www.walmart.com/ip/Straight-Talk-Motorola-Moto-g-202...

dr_kiszonka · 2026-05-09T06:05:45 1778306745

Reckless Condensed?

dakolli · 2026-05-08T21:23:33 1778275413

In 2008, the Department of Homeland Security (DHS) contacted Unspam Technologies, asking, "Do you have any idea how valuable the data you have is?" The DHS' email served as the impetus for Cloudflare, a technology company Prince co-founded with Holloway and fellow Harvard Business School graduate Michelle Zatlyn the following year.

https://en.wikipedia.org/wiki/Matthew_Prince#:~:text=In%2020...

They're literally a government surveillance program larping as a private company, many such cases.

dakolli · 2026-05-08T20:56:36 1778273796

llms aren't doing any of that. Some smart people are using llms find those vulns, there's a huge difference.

block_dagger · 2026-05-08T21:05:00 1778274300

Your definition of "do" seems different than mine.

nozzlegear · 2026-05-08T23:18:12 1778282292

Call it agentic all we want, the LLM has no agency. It's not a living thing, it's a tool employed by humans and it helps humans do things we wouldn't normally be able to do, like a calculator. The fact that Claude is getting the credit for it and not the humans guiding it is just an artifact of Anthropic's marketing.

dakolli · 2026-05-08T18:29:35 1778264975

Everyone thinks this is funny and ignoring the fact that its a great showcase of how net negative and useless llms are to society.

dakolli · 2026-05-08T18:21:34 1778264494

I personally think its easier to detect llm controlled browser sessions, the people deploying them are far more naive and inexperienced than traditional scrapers/crawlers.

insert You wouldn't bring a 40 Petabyte Zip Bomb to School, would you? meme

jeroenhd · 2026-05-08T19:01:30 1778266890

Part of the problem is also that Google wants to permit crawlers to do some things but jot others.

Their announcement is full of buzzwords about "agentic" things. Detecting LLMs is one thing, but imagine the power of being able to pick which LLM browsers are permitted and which aren't!

I think Google is being too early to the party with this. Cloudflare still has CAPTCHAs to throw at the wall. There are ways other than attestation to verify that someone is a real human, but they're getting more and more annoying to real users and harder and harder to implement on a small website.

Despite the massive implications, this is a simple system that just works for the 99% of people who use Chrome or Safari or at least have access to an Android phone or iPhone somewhere. It's quick, doesn't require installing apps or creating accounts, and it just works from both the website perspective and the user perspective.

Of course when you start thinking about people with disabilities things become problematic, but when have tech companies ever really cared about that sort of thing? Inclusiveness was fun and all for a while, but the clowns the American people elected banned that sort of thing for any company considering government contracts, and big tech licked that boot like it was made of honey.

The world becomes a lot easier if you just decide to ignore all edge cases and assume customers who disagree with you didn't matter anyway. And infuriating as it may be, for companies like Google, that business model works.