In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > AI Observability > LLM Latency Profiling and Bottleneck Detection in Python

LLM Latency Profiling and Bottleneck Detection in Python

Author: Venkata Sudhakar

ShopMax India processes thousands of LLM calls per hour across recommendation, search, and support features. When response times degrade, it is critical to pinpoint whether the bottleneck is in the LLM API call, the retrieval step, the post-processing logic, or the network. Python's time module combined with a profiling context manager gives per-step latency measurements without adding external dependencies.

The profiling approach wraps each pipeline stage in a context manager that records start and end timestamps. Results are collected in a dictionary keyed by stage name. After each request, the latency breakdown is logged to a file or monitoring system. This works for any LLM pipeline - RAG, agent chains, or simple completions - without modifying core business logic.

The example below profiles a three-stage ShopMax India pipeline: embedding generation, vector retrieval, and LLM completion. It reports the latency for each stage and identifies the bottleneck.

import time
import openai
import os
from contextlib import contextmanager

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", ""))

latencies = {}

@contextmanager
def profile(stage_name):
    start = time.perf_counter()
    yield
    latencies[stage_name] = round((time.perf_counter() - start) * 1000, 2)

def fake_embed(text):
    time.sleep(0.05)
    return [0.1] * 1536

def fake_retrieve(embedding):
    time.sleep(0.08)
    return ["Product A: Rs 45000", "Product B: Rs 52000"]

def generate_recommendation(query, context):
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a ShopMax India advisor."},
            {"role": "user", "content": "Context: " + context + " Query: " + query}
        ]
    )
    return resp.choices[0].message.content

query = "Best laptop for a data scientist in Bangalore"

with profile("embedding"):
    embedding = fake_embed(query)

with profile("retrieval"):
    docs = fake_retrieve(embedding)

with profile("llm_completion"):
    answer = generate_recommendation(query, str(docs))

print("Latency breakdown (ms):")
for stage, ms in latencies.items():
    print(f"  {stage}: {ms} ms")

slowest = max(latencies, key=latencies.get)
print(f"Bottleneck: {slowest} ({latencies[slowest]} ms)")

It gives the following output,

Latency breakdown (ms):
  embedding: 51.2 ms
  retrieval: 82.4 ms
  llm_completion: 1243.7 ms

Bottleneck: llm_completion (1243.7 ms)

Log latencies to a time-series database such as InfluxDB or Prometheus for trend analysis. Set alert thresholds - if LLM completion exceeds 2000ms, fire an alert to the on-call team. For RAG pipelines, retrieval latency above 200ms usually indicates an index that needs optimisation. Cache embedding results for repeated queries to bring that stage latency close to zero.

Send your comments, suggestions or queries regarding this site to [email protected].