LLM inference optimization has been key for the OpenAI GPT-4o presentation (2x f...

		joaquincabezas on May 20, 2024 \| parent \| context \| favorite \| on: 26× Faster Inference with Layer-Condensed KV Cache... LLM inference optimization has been key for the OpenAI GPT-4o presentation (2x faster, 50% cheaper) and its driving lots of industry research because it’s direct cost savings, but it’s refreshing to see so many techniques published as papers (i.e from Stanford, Berkeley…)