In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Prompt Engineering > Calibration Prompting - Getting Confidence Scores from LLMs

Calibration Prompting - Getting Confidence Scores from LLMs

Author: Venkata Sudhakar

Calibration prompting helps ShopMax India measure how confident an LLM is in its answers, letting the platform surface high-confidence responses to customers while flagging uncertain ones for human review. When a customer asks whether a specific laptop model supports a software requirement, ShopMax needs a confidence score alongside the answer - not just the answer itself.

LLMs do not natively output probability scores, but you can prompt them to express confidence using structured formats. Common techniques include asking the model to rate its own certainty on a 0-10 scale, requesting it to output a JSON field like 'confidence': 0.85, or using token log probabilities from the API when available. OpenAI and Anthropic APIs expose logprobs parameters that let you compute calibration metrics programmatically.

The following example shows ShopMax India using OpenAI with logprobs enabled to extract the top-token probability for a product compatibility question. The script computes a simple confidence score from the log probability of the first output token and routes the response based on a threshold.

import openai
import math

client = openai.OpenAI(api_key="sk-...")

def get_answer_with_confidence(question):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer with Yes or No first, then explain briefly."},
            {"role": "user", "content": question}
        ],
        logprobs=True,
        top_logprobs=3,
        max_tokens=100
    )
    answer = response.choices[0].message.content
    top_logprob = response.choices[0].logprobs.content[0].logprob
    confidence = math.exp(top_logprob)
    return answer, confidence

questions = [
    "Does the Dell XPS 15 support Ubuntu 22.04 for software development?",
    "Is the Samsung Galaxy S24 compatible with all Indian telecom bands?",
    "Can the Sony WH-1000XM5 headphones connect to two devices simultaneously?"
]

for q in questions:
    answer, conf = get_answer_with_confidence(q)
    status = "HIGH CONFIDENCE" if conf > 0.80 else "NEEDS REVIEW"
    print(f"Q: {q}")
    print(f"A: {answer[:60]}...")
    print(f"Confidence: {conf:.2f} [{status}]")
    print()

It gives the following output,

Q: Does the Dell XPS 15 support Ubuntu 22.04 for software development?
A: Yes, the Dell XPS 15 supports Ubuntu 22.04 and is widely...
Confidence: 0.94 [HIGH CONFIDENCE]

Q: Is the Samsung Galaxy S24 compatible with all Indian telecom bands?
A: Yes, the Samsung Galaxy S24 supports major Indian 4G and 5G...
Confidence: 0.87 [HIGH CONFIDENCE]

Q: Can the Sony WH-1000XM5 headphones connect to two devices simultaneously?
A: Yes, Sony WH-1000XM5 supports Multipoint Connection allowing...
Confidence: 0.91 [HIGH CONFIDENCE]

For production at ShopMax India, set your confidence threshold based on the consequences of a wrong answer. Product compatibility claims (Rs 80,000 laptop) warrant a higher threshold than general product descriptions. Log all low-confidence responses to a review queue and periodically audit them to refine your prompt design. If logprobs are not available for your model, use a self-rating prompt like 'Rate your certainty 1-10 before answering' as a practical fallback.

Send your comments, suggestions or queries regarding this site to [email protected].