Is there a way to amortize that cost over several queries, i.e. "pre-bake" a doc...

simonw · on May 14, 2024

They announced that today, calling it "context caching" - but it looks like it's only going to be available for Gemini Pro 1.5, not for Gemini Flash.

It reduces prompt costs by half for those shared prefix tokens, but you have to pay $4.50/million tokens/hour to keep that cache warm - so probably not a useful optimization for most lower traffic applications.

https://ai.google.dev/gemini-api/docs/caching

dragonwriter · on May 14, 2024

> It reduces prompt costs by half for those shared prefix tokens, but you have to pay $4.50/million tokens/hour to keep that cache warm - so probably not a useful optimization for most lower traffic applications

That's on a model with $3.5/1M input token cost, so half price on cached prefix tokens for $4.5/1M/hour breaks even at a little over 2.5 requests/hour using the cached prefix.

inlined · on May 14, 2024

Though I'm not familiar with the specifics, they announced "context caching"

gcanyon · on May 14, 2024

Depending on the output window limit, the first query could be something like: "Summarize this down to its essential details" -- then use that to feed future queries.

Tediously, it would be possible to do this chapter by chapter in order to exceed the output limit building something for future inputs.

Of course, the summary might not fulfill the same functionality as the original source document. YMMV