Hacker Newsnew | past | comments | ask | show | jobs | submit | djsjajah's commentslogin

You just failed the Turing test.


The Turing test just failed you. I'll go one better, physics isn't reality, it's a model of reality utilizing math.


And I'll go one better, you haven't said anything here at all, you've just left a representation of what you understand to be saying.


we only perceive the past.


Maybe he passed the Turing test with 88.2% which is 1.8% higher than the competition.


Fortunately for me equivalents to Turing exist: https://en.wikipedia.org/wiki/Turing_machine_equivalents


I don't follow. Can you explain how your comment is relevant to mine? It might help if you also explain how you interpreted my comment.


I have 2 of them. I would advise against if you want to run things like vllm. I have had the cards for months and I still have not been able to create a uv env with trl and vllm. For vllm, it’s works fine in docker for some models. With one gpu, gpt-oss 20b decoding at a cumulative 600-800tps with 32 concurrent requests depending on context length but I was getting trash performance out of qwen3.5 and Gemma4

If I were to do it again, I’d probably just get a dgx spark. I don’t think it’s been worth the hassle.


FWIW I’m in love with my Asus GX10 and have been learning CUDA on it while playing with vllm and such. Qwen3.5 122B A10 at ~50tps is quite neat.

But do beware, it’s weird hardware and not really Blackwell. We are only just starting to squeeze full performance out of SM12.1 lately!


> or by the community

Hmmm


yes, but the difference between one model and one 4x larger is usually a lot more than that.

It is not a question of do a run Qwen 8b at bf16 or a quantized version. It more of a question of do I run Qwen 8b at full precision or do I run a quantized version of Qwen 27b.

You will find that you are usually better off with the larger model.


trl. give me a uv command to get that working.

But even in the amd stack things (like ck and aiter) consumer cards are not even second class citizens. They are a distance third at best. If you just want to run vllm with the latest model, if you can get it running at all there are going to be paper cuts all along the way and even then the performance won't be close to what you could be getting out of the hardware.


It is not perfect, but it isn't that bad anymore. Tons of improvements over the last year.


No. It seems to me that the comment is objectively incorrect. The original comment was talking about inference and from what I can tell, it is strictly going to run slower than the model trained to the same loss without this approach (it has "minimal overhead"). The main point is that you wont need to train that model for as long.


That’s kind of a moot point. Even if none of those overheads existed you would still be getting a a fractions of the mfu. Models are fundamental limited by memory bandwidth even with best case scenarios of sft or prefill.

And what are you doing that I/O is a bottleneck?


> That’s kind of a moot point.

I don't believe it's moot, but I understand your point. The fact that models are memory bandwidth bound does not at all mean that other overhead is insignificant. Your practical delivered throughput is the minimum of compute ceiling, bandwidth ceiling, and all the unrelated speed limits you hit in the stack. Kernel launch latency, Python dispatch, framework bookkeeping, allocator churn, graph breaks, and sync points can all reduce effective speed. There are so many points in the training and inference loop where the model isn't even executing.

> And what are you doing that I/O is a bottleneck?

We do a fair amount of RLVR at my org. That's almost entirely waiting for servers/envs to do things, not the model doing prefill or decode (or even up/down weighting trajectories). The model is the cheap part in wall clock terms. The hard limits are in the verifier and environment pipeline. Spinning up sandboxes, running tests, reading and writing artifacts, and shuttling results through queues, these all create long idle gaps where the GPU is just waiting to do something.


> That's almost entirely waiting for servers/envs to do things

I'm not sure why, sandboxes/envs should be small and easy to scale horizontally to the point where your throughput is no longer limited by them, and the maximum latency involved should also be quite tiny (if adequately optimized). What am I missing?


First as an aside, remember that this entire thread is about using local compute. What you're alluding to is some fantasy infinite budget where you have limitless commodity compute. That's not at all the context of this thread.

But disregarding that, this isn't a problem you can solve by turning a knob akin to scaling a stateless k8s cluster.

The whole vertical of distributed RL has been struggling with this for a while. You can in theory just keep adding sandboxes in parallel, but in RLVR you are constrained by 1) the amount of rollout work you can do per gradient update, and 2) the verification and pruning pipeline that gates the reward signal.

You cant just arbitrarily have a large batch size for every rollout phase. Large batches often reduce effective diversity or get dominated by stragglers. And the outer loop is inherently sequential, because each gradient update depends on data generated by a particular policy snapshot. You can parallelize rollouts and the training step internally, but you can’t fully remove the policy-version dependency without drifting off-policy and taking on extra stability headaches.


> including all previous experiments

How far back do you go? What about experiments into architecture features that didn’t make the cut? What about pre-transformer attention?

But more generally, why are you so sure that they team that built Gemini didn’t exclusively use TPUs while they were developing it?

I think that one of the reasons that Gemini caught up so quickly is because they have so much compute at fraction of the price of everyone else.


Not only can it be streamed, but lz4 will probably make things quicker.


You just ruined my day. The post makes it sound like gel is now dead. The post by Vercel does not give me much hope either [1]. Last commit on the gel repo was two weeks ago.

[1] https://vercel.com/blog/investing-in-the-python-ecosystem


From discord:

> There has been a ton of interest expressed this week about potential community maintenance of Gel moving forward. To help organize and channel these hopes, I'm putting out a call for volunteers to join a Gel Community Fork Working Group (...GCFWG??). We are looking for 3-5 enthusiastic, trustworthy, and competent engineers to form a working group to create a "blessed" community-maintained fork of Gel. I would be available as an advisor to the WG, on a limited basis, in the beginning.

> The goal would be to produce a fork with its own build and distribution infrastructure and a credible commitment to maintainership. If successful, we will link to the project from the old Gel repos before archiving them, and potentially make the final CLI release support upgrading to the community fork.

> Applications accepted here: https://forms.gle/GcooC6ZDTjNRen939

> I'll be reaching out to people about applications in January.


I would think a two-week break over the holiday season wouldn't be a death knell.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: