Running it on a Macbook Pro M5 48GB: -hf unsloth/Qwen3.6-27B-GGUF:UD-Q6_K_XL \ -...

jonaustin · 2026-04-27T19:45:16 1777319116

note: 27b is going to be slow; use the 35b MoE if you want decent token/sec speed.

dexterlagan · 2026-04-28T07:21:38 1777360898

Many of us tested 27B and 35B side by side, and the dense model is significantly smarter. It indeed is slower, but 35B makes a lot of mistakes 27B doesn't.

sleepyeldrazi · 2026-04-27T21:16:54 1777324614

I haven't honestly dug around to figure out if there's a hardware reason for it, but prompt processing has always been a lot slower for me on macs in general. I mostly use MLX on my 24GB M4 Pro though, so I will pull llama.cpp on it as well to see what the prefill is like.

I've gotten around 16 t/s gen with 4bit and mxfp4 on that model for generation. The 3090 I mentioned has a little over 900 gb/s, while those macs i think are around 270 GB/s. If my understanding is correct, macs do utilize the bandwidth better in this case, but it still doesn't make up the difference (on the 3090 it's around 30-35 t/s depending on size of ctx).

Also, do run a quick experiment removing the cache quants if you want to tinker with it a bit more, iirc KV quant does add a small overhead during prefill.

I would be very interested to know your prefill and generation numbers.