|
|
Claude Prompt Caching - Reducing Costs on Repeated Context
Author: Venkata Sudhakar
Prompt caching lets Claude reuse the KV cache from previous requests when the same large prefix (system prompt, documents, examples) appears repeatedly. For ShopMax India, this is a major cost saver: a support agent that prepends 10,000 tokens of product catalog context on every request pays full price for those tokens each time without caching, but with caching pays only 10% of the input cost on cache hits. Cache TTL is 5 minutes and resets on each hit, so active sessions stay cached automatically.
To enable caching, add a cache_control block with type ephemeral to the content you want cached. The cache breakpoint must be placed at the end of the reusable prefix - everything before it is eligible for caching. The usage object in the response includes cache_creation_input_tokens (tokens written to cache on the first request) and cache_read_input_tokens (tokens read from cache on subsequent requests). Cache writes cost 25% more than normal input tokens but cache reads cost only 10% of normal, making it profitable after just 2 requests.
The following example shows ShopMax India caching a large product catalog system prompt that is shared across all customer queries, measuring the token savings on repeated calls:
It gives the following output,
CACHE WRITE
Answer: We have two great 4K TVs under Rs 50,000 at ShopMax India: the LG 50-inch
4K NanoCell at Rs 44,990 with WebOS and Dolby Vision ...
Input tokens: 312
Cache write tokens: 298
Output tokens: 58
CACHE READ
Answer: Yes! We have the IFB 6.5kg Front Load Washing Machine at Rs 23,490 and
the LG 8kg Front Load at Rs 31,990, both available in Mumbai ...
Input tokens: 14
Cache read tokens: 298
Output tokens: 62
CACHE READ
Answer: For Delhi climate, I recommend the Voltas 1T Window AC at Rs 27,990
specifically designed for Delhi conditions, and the Daikin 1.5T Split ...
Input tokens: 14
Cache read tokens: 298
Output tokens: 71
For ShopMax India production, place the cache breakpoint after any content that stays constant across requests - product catalogs, policy documents, few-shot examples. Never cache content that varies per user (personalization context, cart state) since it prevents cache hits. The 5-minute TTL means cache stays warm during an active chat session but expires between sessions. For a support agent handling 1000 queries per hour with a 10KB system prompt, caching reduces input token costs by roughly 85% after the first request in each 5-minute window. Monitor cache_read_input_tokens in your logging to verify cache hit rates in production.
|
|