In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Anthropic Claude API > Claude Prompt Caching - Reducing Costs on Repeated Context

Claude Prompt Caching - Reducing Costs on Repeated Context

Author: Venkata Sudhakar

Prompt caching lets Claude reuse the KV cache from previous requests when the same large prefix (system prompt, documents, examples) appears repeatedly. For ShopMax India, this is a major cost saver: a support agent that prepends 10,000 tokens of product catalog context on every request pays full price for those tokens each time without caching, but with caching pays only 10% of the input cost on cache hits. Cache TTL is 5 minutes and resets on each hit, so active sessions stay cached automatically.

To enable caching, add a cache_control block with type ephemeral to the content you want cached. The cache breakpoint must be placed at the end of the reusable prefix - everything before it is eligible for caching. The usage object in the response includes cache_creation_input_tokens (tokens written to cache on the first request) and cache_read_input_tokens (tokens read from cache on subsequent requests). Cache writes cost 25% more than normal input tokens but cache reads cost only 10% of normal, making it profitable after just 2 requests.

The following example shows ShopMax India caching a large product catalog system prompt that is shared across all customer queries, measuring the token savings on repeated calls:

import anthropic
import os

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

PRODUCT_CATALOG = """
ShopMax India Product Catalog - Electronics Division

Televisions:
- Samsung 55-inch 4K QLED (SM-TV-55Q): Rs 54990, Quantum HDR, Tizen OS, 2 HDMI
- LG 50-inch 4K NanoCell (LG-TV-50N): Rs 44990, WebOS, Dolby Vision, 3 HDMI
- Sony 43-inch Bravia (SNY-TV-43B): Rs 38990, Google TV, X-Reality Pro

Washing Machines:
- IFB 6.5kg Front Load (IFB-WM-65F): Rs 23490, 1200 RPM, 5 star energy
- Samsung 7kg Top Load (SAM-WM-7T): Rs 19990, Wobble technology, quick wash
- LG 8kg Front Load (LG-WM-8F): Rs 31990, AI DD, steam wash, ThinQ

Air Conditioners:
- Daikin 1.5T Split (DKN-AC-15S): Rs 38990, 5 star, inverter, PM 2.5 filter
- Voltas 1T Window (VLT-AC-1W): Rs 27990, 3 star, fast cooling, Delhi climate
- Blue Star 2T Split (BS-AC-2S): Rs 44990, 5 star, anti-bacterial, Inverter

Service cities: Mumbai, Bangalore, Delhi, Hyderabad, Chennai
Warranty: 1 year comprehensive, extended plans available
EMI: 0% EMI available on HDFC, ICICI, Axis cards
"""

def ask_with_cache(question: str, first_call: bool = False) -> dict:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=300,
        system=[
            {
                "type": "text",
                "text": "You are ShopMax India product assistant. Answer customer queries using the catalog below.\n\n" + PRODUCT_CATALOG,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": question}]
    )
    label = "CACHE WRITE" if first_call else "CACHE READ"
    print(label)
    print("Answer:", response.content[0].text[:120], "...")
    usage = response.usage
    print("Input tokens:", usage.input_tokens)
    if hasattr(usage, "cache_creation_input_tokens"):
        print("Cache write tokens:", usage.cache_creation_input_tokens)
    if hasattr(usage, "cache_read_input_tokens"):
        print("Cache read tokens:", usage.cache_read_input_tokens)
    print("Output tokens:", usage.output_tokens)
    print()
    return response

ask_with_cache("What TVs do you have under Rs 50000?", first_call=True)
ask_with_cache("Do you have washing machines with front load in Mumbai?")
ask_with_cache("What ACs are available for Delhi climate?")

It gives the following output,

CACHE WRITE
Answer: We have two great 4K TVs under Rs 50,000 at ShopMax India: the LG 50-inch
4K NanoCell at Rs 44,990 with WebOS and Dolby Vision ...
Input tokens: 312
Cache write tokens: 298
Output tokens: 58

CACHE READ
Answer: Yes! We have the IFB 6.5kg Front Load Washing Machine at Rs 23,490 and
the LG 8kg Front Load at Rs 31,990, both available in Mumbai ...
Input tokens: 14
Cache read tokens: 298
Output tokens: 62

CACHE READ
Answer: For Delhi climate, I recommend the Voltas 1T Window AC at Rs 27,990
specifically designed for Delhi conditions, and the Daikin 1.5T Split ...
Input tokens: 14
Cache read tokens: 298
Output tokens: 71

For ShopMax India production, place the cache breakpoint after any content that stays constant across requests - product catalogs, policy documents, few-shot examples. Never cache content that varies per user (personalization context, cart state) since it prevents cache hits. The 5-minute TTL means cache stays warm during an active chat session but expires between sessions. For a support agent handling 1000 queries per hour with a 10KB system prompt, caching reduces input token costs by roughly 85% after the first request in each 5-minute window. Monitor cache_read_input_tokens in your logging to verify cache hit rates in production.

Send your comments, suggestions or queries regarding this site to [email protected].