I would have liked to see some discussions on cost and comparisons to dGPUs in t...

roenxi · 2025-06-06T12:46:51 1749214011

We have a GPU/CPU fusion chip that is unremarkable, performant and runs well under linux out of the box. There isn't a lot of novelty in that which is, in itself, pretty remarkable.

Plus, although I can't really swear to understand how these chips work, my read is this is basically a graphics card that can be configured with 64GB of memory. If I'm not misreading that it actually sounds quite interesting; even AMDs hopeless compute drivers might potentially be useful for AI work if enough RAM gets thrown into the mix. Although my burn wounds from buying AMD haven't healed yet so I'll let someone else fund that experiment.

jakogut · 2025-06-06T14:31:33 1749220293

I've done it. I have a GPD Pocket 4 with 64 GB of RAM and the less capable HX 370 Strix Point chip.

Using ollama, hardware acceleration doesn't really work through ROCm. The framework doesn't officially support gfx1151 (Strix Point RDNA 3.5+), though you can override it to fake gfx1150 (Strix Halo, also RDNA 3.5+ and UMA), and it works.

I think I got it to work for smaller models that fit entirely into the preallocated VRAM buffer, but my machine only allows for statically allocating up to 16 GB for the GPU, and where's the fun in that? This is a unified memory architecture chip, I want to be able to run 30+ GB models seamlessly.

It turns out, you can. Just build llama.cpp from source with the Vulkan backend enabled. You can use a 2 GB static VRAM allocation and any additional data spills into GTT which the driver handles mapping into the GPU's address space seamlessly.

You can see a benchmark I performed of a small model on GitHub [0], but I've done up to Gemma3 27b (~21 GB) and other large models with decent performance, and Strix Halo is supposed to have 2-3x the memory bandwidth and compute performance. Even 8b models perform well with the GPU in power saving mode, inside ~8W.

Come to think of it, those results might make a good blog post.

[0] https://github.com/ggml-org/llama.cpp/discussions/10879

Search for "HX 370"

SecretDreams · 2025-06-06T15:43:30 1749224610

> We have a GPU/CPU fusion chip that is unremarkable, performant and runs well under linux out of the box. There isn't a lot of novelty in that which is, in itself, pretty remarkable.

With this reasoning, I'd probably argue all modern cpus and GPUs aren't particularly remarkable/novel. That could help even be fine.

At the end of the day, these benchmarks are all meant to inform on relative performance, price, and power consumption for end users to make informed decisions (imo). The relative* comparisons are low-key just as important as the new bench data point.

jauntywundrkind · 2025-06-06T16:25:05 1749227105

Agreed that for GPU oomph, you'd do better with a dGPU; there's some very good deals for dGPU laptops but even this next-tier-down 8050S is still a rather expensive new purchase by compare (for now; Strix Halo is brand new). But the power consumption will likely be much higher.

Strix Halo as an APU has two very clear advantages. First, I expect power consumption is somewhat better, due to using LPDDR5(x?) and not needing to go over PCIe.

But the real win is that you can get a 64GB or 128GB GPU (well somewhat less than that)! And there's not really anything stopping 192GB or 256GB builds from happening, now that bigger ram sizes are finally available in the world. But so far all Strix Halo offerings are soldered on ram (non user upgradeable, no camm2 offerings yet), and no one's doing more than 128GB. But that's still a huge LLM compared to what consumers could run before! Or lots of LLMs loaded and ready to go! We see similar things with the large unified memory on Mac APUs; it's why are minis are sometimes popular for LLMs.

Meanwhile Nvidia is charging $20k+ for an A100 GPU with 80GB ram. You won't have that level of performance, but you I'll be able to fit an even bigger LLM than it. For 1/10th the price.

There's also a lot of neat applications for DB's or any kind of data-intense processing. Because unified memory means the work can move between CPU and GPU, without having to move the workload. Normally to use a GPU you end up copying data out of main memory then writing it to the GPU* then reading it to do work, and you can skip 2/3rds of these read/write steps here.

There's some very awesome potential for doing query processing on GPU (ex: PG-Strom). Might also be a pretty interesting for a GPU based router, ala PacketShader (2010).

* Note that PCIe p2p-dma / device RDMA / dma-buf has been getting progressively much better, a lot of attention, across the past half deacde, such that say a nic can send network data direct to GPU memory, or a NVMe drive can send data direct to GPU or network, without bouncing through main memory. One recent example of many: https://www.phoronix.com/news/Device-Memory-TCP-TX-Linux-6.1...

Giving an APU actually fast ram has some cool use cases. I'm excited to see the lines blur in computing like this, to see

dragonwriter · 2025-06-06T17:00:31 1749229231

> Meanwhile Nvidia is charging $20k+ for an A100 GPU with 80GB ram.

Or, sometime in the next month or so, NVidia GB10-based miniPC form factor devices with 128GB (with high-speed interconnect to allow two to serve as a single 256GB system) from various brands (including direct from Nvidia) for $3000-4000 depending on exact configuration and who is assembling the completed system.

opencl · 2025-06-06T13:43:07 1749217387

The Strix Halo GPU is roughly around RTX 4060 (laptop version) performance.

Phoronix just doesn't do much mobile dGPU testing in general to have any data to compare with there.

michaellarabel · 2025-06-06T13:46:55 1749217615

Right, unfortunately, was limited by the laptops I have on-hand for (re)testing... With routinely re-testing all laptops fresh, in this case on Ubuntu 25.04, not able to compare to prior dGPU-enabled laptops that since had to be returned to vendors, etc.

SecretDreams · 2025-06-06T15:41:30 1749224490

It makes sense, just leaves the article feeling kind of incomplete as a result. It's more like a data point to be compiled against other existing data. They could pull more clicks if they either had some of that other data done in house or collaborated with some other shop that does have the data (or a laptop to loan) readily available.

rbanffy · 2025-06-06T12:34:28 1749213268

It appears it's using significantly more power than the Intel ones as well. I would be more interested in GPU computing performance for LLMs than graphics, as all I need is a frame buffer.