tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Prompt Engineering > Prompt Compression with LLMLingua - Reducing Token Costs

Prompt Compression with LLMLingua - Reducing Token Costs

Author: Venkata Sudhakar

Prompt compression reduces token count by removing redundant words from long prompts while preserving meaning. At ShopMax India, product descriptions, customer feedback, and policy documents fed into prompts can be compressed before sending to the LLM - cutting input token costs by 30-60% with minimal quality loss. LLMLingua is an open-source library from Microsoft that does this automatically using a small language model to score and drop less-important tokens.

LLMLingua works by using a small LM (like GPT-2 or Llama) to compute the perplexity of each token. Low-perplexity tokens (highly predictable given context) are candidates for removal. A compression ratio parameter controls how aggressively tokens are dropped. Typical settings of 0.5-0.7 (keep 50-70% of tokens) produce good compression with minimal answer quality degradation for summarization and extraction tasks.

The example below shows ShopMax India compressing a long customer feedback document before sending it to Claude for sentiment analysis. The compression reduces token count significantly while the LLM still extracts the correct sentiment and key issues.


It gives the following output,

Original tokens (approx): 148
Compressed tokens (approx): 74
Compression ratio achieved: 0.50

Analysis from compressed prompt:
Sentiment: NEGATIVE
Main issue: Insufficient cooling performance - room temperature remains high
despite 18 degree setting, especially during afternoon peak hours.
Resolution status: UNRESOLVED - customer raised issue twice with support,
technician visit promised but not yet fulfilled. Post-purchase experience
rated as unsatisfactory despite product quality being acceptable.

At ShopMax India, apply LLMLingua compression to the customer feedback and document sections of prompts, not to the instruction itself - compressing instructions degrades task accuracy. Benchmark compression ratio vs quality for your specific task types before deploying: customer feedback summaries typically tolerate 50% compression well, while legal or policy documents may need a milder 70% ratio (dropping only 30% of tokens). Run a nightly batch compression job on product reviews before they are fed into daily analytics prompts, storing the compressed versions to save re-compression overhead on repeated queries.


 
  


  
bl  br