Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This can be combined with Grouped Query Attention or Multi-Query Attention for an even further reduction in the size of the KV Cache!


But it’s not free, it cuts quality significantly.

It’s not hard to find ways to speed up transformers if you’re willing to give up quality.

You could argue some tradeoffs are worth it and that’s true sometimes but I don’t see that they’ve made the case for it here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: