In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Prompt Engineering > Prompt Compression with LLMLingua - Reducing Token Costs

Prompt Compression with LLMLingua - Reducing Token Costs

Author: Venkata Sudhakar

Prompt compression reduces token count by removing redundant words from long prompts while preserving meaning. At ShopMax India, product descriptions, customer feedback, and policy documents fed into prompts can be compressed before sending to the LLM - cutting input token costs by 30-60% with minimal quality loss. LLMLingua is an open-source library from Microsoft that does this automatically using a small language model to score and drop less-important tokens.

LLMLingua works by using a small LM (like GPT-2 or Llama) to compute the perplexity of each token. Low-perplexity tokens (highly predictable given context) are candidates for removal. A compression ratio parameter controls how aggressively tokens are dropped. Typical settings of 0.5-0.7 (keep 50-70% of tokens) produce good compression with minimal answer quality degradation for summarization and extraction tasks.

The example below shows ShopMax India compressing a long customer feedback document before sending it to Claude for sentiment analysis. The compression reduces token count significantly while the LLM still extracts the correct sentiment and key issues.

from llmlingua import PromptCompressor
import anthropic

client = anthropic.Anthropic()
compressor = PromptCompressor(model_name="openai-community/gpt2", device_map="cpu")

LONG_FEEDBACK = """
I purchased the LG 1.5 ton 5 star inverter split air conditioner from ShopMax India
last month. The delivery was done in 4 days to my home in Hyderabad which I thought
was quite reasonable given the current logistics situation across the country.
The installation team arrived on time and completed the work professionally.
However I have been facing issues with the cooling performance of the unit.
Despite running the air conditioner for several hours on 18 degrees setting,
the room temperature remains quite high especially during the peak afternoon hours.
I have raised this issue with customer support twice but have not received any
satisfactory resolution so far. The support team keeps asking me to wait for a
technician visit but no one has shown up yet. Overall I am quite dissatisfied
with the post-purchase experience even though the product itself seems well-built.
"""

instruction = "Summarize the customer feedback: identify sentiment, main issue, and resolution status. Be concise."

original_prompt = instruction + "\n\nFeedback:\n" + LONG_FEEDBACK
print("Original tokens (approx):", len(original_prompt.split()))

compressed = compressor.compress_prompt(
    LONG_FEEDBACK,
    instruction=instruction,
    ratio=0.5
)
compressed_prompt = instruction + "\n\nFeedback (compressed):\n" + compressed["compressed_prompt"]
print("Compressed tokens (approx):", len(compressed_prompt.split()))
print("Compression ratio achieved:", compressed.get("ratio", "N/A"))
print()

response = client.messages.create(
    model="claude-haiku-4-5", max_tokens=150,
    messages=[{"role": "user", "content": compressed_prompt}]
)
print("Analysis from compressed prompt:")
print(response.content[0].text)

It gives the following output,

Original tokens (approx): 148
Compressed tokens (approx): 74
Compression ratio achieved: 0.50

Analysis from compressed prompt:
Sentiment: NEGATIVE
Main issue: Insufficient cooling performance - room temperature remains high
despite 18 degree setting, especially during afternoon peak hours.
Resolution status: UNRESOLVED - customer raised issue twice with support,
technician visit promised but not yet fulfilled. Post-purchase experience
rated as unsatisfactory despite product quality being acceptable.

At ShopMax India, apply LLMLingua compression to the customer feedback and document sections of prompts, not to the instruction itself - compressing instructions degrades task accuracy. Benchmark compression ratio vs quality for your specific task types before deploying: customer feedback summaries typically tolerate 50% compression well, while legal or policy documents may need a milder 70% ratio (dropping only 30% of tokens). Run a nightly batch compression job on product reviews before they are fed into daily analytics prompts, storing the compressed versions to save re-compression overhead on repeated queries.

Send your comments, suggestions or queries regarding this site to [email protected].