Is there a way to amortize that cost over several queries, i.e. "pre-bake" a document into a context persisted in some form to allow cheaper follow-up queries about it?
They announced that today, calling it "context caching" - but it looks like it's only going to be available for Gemini Pro 1.5, not for Gemini Flash.
It reduces prompt costs by half for those shared prefix tokens, but you have to pay $4.50/million tokens/hour to keep that cache warm - so probably not a useful optimization for most lower traffic applications.
> It reduces prompt costs by half for those shared prefix tokens, but you have to pay $4.50/million tokens/hour to keep that cache warm - so probably not a useful optimization for most lower traffic applications
That's on a model with $3.5/1M input token cost, so half price on cached prefix tokens for $4.5/1M/hour breaks even at a little over 2.5 requests/hour using the cached prefix.
Depending on the output window limit, the first query could be something like: "Summarize this down to its essential details" -- then use that to feed future queries.
Tediously, it would be possible to do this chapter by chapter in order to exceed the output limit building something for future inputs.
Of course, the summary might not fulfill the same functionality as the original source document. YMMV