tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > AI Observability > LLM Latency Profiling and Bottleneck Detection in Python

LLM Latency Profiling and Bottleneck Detection in Python

Author: Venkata Sudhakar

ShopMax India processes thousands of LLM calls per hour across recommendation, search, and support features. When response times degrade, it is critical to pinpoint whether the bottleneck is in the LLM API call, the retrieval step, the post-processing logic, or the network. Python's time module combined with a profiling context manager gives per-step latency measurements without adding external dependencies.

The profiling approach wraps each pipeline stage in a context manager that records start and end timestamps. Results are collected in a dictionary keyed by stage name. After each request, the latency breakdown is logged to a file or monitoring system. This works for any LLM pipeline - RAG, agent chains, or simple completions - without modifying core business logic.

The example below profiles a three-stage ShopMax India pipeline: embedding generation, vector retrieval, and LLM completion. It reports the latency for each stage and identifies the bottleneck.


It gives the following output,

Latency breakdown (ms):
  embedding: 51.2 ms
  retrieval: 82.4 ms
  llm_completion: 1243.7 ms

Bottleneck: llm_completion (1243.7 ms)

Log latencies to a time-series database such as InfluxDB or Prometheus for trend analysis. Set alert thresholds - if LLM completion exceeds 2000ms, fire an alert to the on-call team. For RAG pipelines, retrieval latency above 200ms usually indicates an index that needs optimisation. Cache embedding results for repeated queries to bring that stage latency close to zero.


 
  


  
bl  br