This can be combined with Grouped Query Attention or Multi-Query Attention for a... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		tripplyons on May 20, 2024 \| parent \| context \| favorite \| on: 26× Faster Inference with Layer-Condensed KV Cache... This can be combined with Grouped Query Attention or Multi-Query Attention for an even further reduction in the size of the KV Cache!

WhitneyLand on May 20, 2024 | [–]

But it’s not free, it cuts quality significantly.

It’s not hard to find ways to speed up transformers if you’re willing to give up quality.

You could argue some tradeoffs are worth it and that’s true sometimes but I don’t see that they’ve made the case for it here.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact