Maybe someone knows any tips to optimise prompt processing as that's the slowest part? It takes a few minutes before OpenCode with ~20k initial context first responds, but subsequent responses are pretty fast due to caching.
Many of us tested 27B and 35B side by side, and the dense model is significantly smarter. It indeed is slower, but 35B makes a lot of mistakes 27B doesn't.
I haven't honestly dug around to figure out if there's a hardware reason for it, but prompt processing has always been a lot slower for me on macs in general. I mostly use MLX on my 24GB M4 Pro though, so I will pull llama.cpp on it as well to see what the prefill is like.
I've gotten around 16 t/s gen with 4bit and mxfp4 on that model for generation. The 3090 I mentioned has a little over 900 gb/s, while those macs i think are around 270 GB/s. If my understanding is correct, macs do utilize the bandwidth better in this case, but it still doesn't make up the difference (on the 3090 it's around 30-35 t/s depending on size of ctx).
Also, do run a quick experiment removing the cache quants if you want to tinker with it a bit more, iirc KV quant does add a small overhead during prefill.
I would be very interested to know your prefill and generation numbers.