In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Hugging Face > Speech Recognition with Hugging Face Whisper

Speech Recognition with Hugging Face Whisper

Author: Venkata Sudhakar

Speech recognition converts spoken audio into text. ShopMax India uses it to transcribe customer voice complaints received through their IVR helpline, turning audio recordings into searchable text that support agents can read, categorise, and resolve without listening to each call manually. Whisper handles Indian-accented English reliably out of the box.

OpenAI's Whisper model is available on Hugging Face through the automatic-speech-recognition pipeline. Whisper is a transformer-based model trained on 680,000 hours of multilingual audio. It supports automatic language detection and can transcribe English, Hindi, and other Indian languages. The pipeline accepts local audio file paths or raw audio arrays sampled at 16kHz.

The example below transcribes a customer voice complaint audio file for ShopMax India's support team, using the Whisper small model for fast local inference.

It gives the following output,

Transcription:
 I have a dream that one day this nation will rise up and live
 out the true meaning of its creed.

Chunked transcription:
 I have a dream that one day this nation will rise up and live
 out the true meaning of its creed.

For ShopMax India's IVR pipeline, use whisper-medium or whisper-large-v3 for better accuracy on Indian-accented English and Hindi. Enable return_timestamps=True to get word-level timestamps useful for highlighting key complaint segments. Process calls in batches overnight to reduce GPU costs. Store transcriptions in your support database indexed by order ID for fast retrieval.

Send your comments, suggestions or queries regarding this site to [email protected].