|
|
Calibration Prompting - Getting Confidence Scores from LLMs
Author: Venkata Sudhakar
Calibration prompting helps ShopMax India measure how confident an LLM is in its answers, letting the platform surface high-confidence responses to customers while flagging uncertain ones for human review. When a customer asks whether a specific laptop model supports a software requirement, ShopMax needs a confidence score alongside the answer - not just the answer itself.
LLMs do not natively output probability scores, but you can prompt them to express confidence using structured formats. Common techniques include asking the model to rate its own certainty on a 0-10 scale, requesting it to output a JSON field like 'confidence': 0.85, or using token log probabilities from the API when available. OpenAI and Anthropic APIs expose logprobs parameters that let you compute calibration metrics programmatically.
The following example shows ShopMax India using OpenAI with logprobs enabled to extract the top-token probability for a product compatibility question. The script computes a simple confidence score from the log probability of the first output token and routes the response based on a threshold.
It gives the following output,
Q: Does the Dell XPS 15 support Ubuntu 22.04 for software development?
A: Yes, the Dell XPS 15 supports Ubuntu 22.04 and is widely...
Confidence: 0.94 [HIGH CONFIDENCE]
Q: Is the Samsung Galaxy S24 compatible with all Indian telecom bands?
A: Yes, the Samsung Galaxy S24 supports major Indian 4G and 5G...
Confidence: 0.87 [HIGH CONFIDENCE]
Q: Can the Sony WH-1000XM5 headphones connect to two devices simultaneously?
A: Yes, Sony WH-1000XM5 supports Multipoint Connection allowing...
Confidence: 0.91 [HIGH CONFIDENCE]
For production at ShopMax India, set your confidence threshold based on the consequences of a wrong answer. Product compatibility claims (Rs 80,000 laptop) warrant a higher threshold than general product descriptions. Log all low-confidence responses to a review queue and periodically audit them to refine your prompt design. If logprobs are not available for your model, use a self-rating prompt like 'Rate your certainty 1-10 before answering' as a practical fallback.
|
|