We conducted a comprehensive evaluation of Automatic Speech Recognition (ASR) systems using carefully curated benchmark datasets in English and Hindi sourced from YouTube videos. The benchmarking results are summarized as follows:
| Provider | Model Evaluated |
|---|---|
| SandLogic STT (Speech-to-Text) | Shakti |
| Deepgram | nova-3 (for English), nova-2 (for Hindi) |
| Chirp_3 | |
| Microsoft Azure | Best/Default |
| ElevenLabs | Scribe_v2 |
| Sarvam | Saaras_V3 |
The benchmark was curated from publicly available long-form online videos based on the following considerations:
Preference was given to conversational audio, monologues, and interviews rather than synthesized or scripted speech.
Samples include American, British, Indian, and regionally influenced English accents to evaluate accent generalization capabilities.
Audio was selected based on varying levels of background sound ranging from clean studio-grade recordings to noisy clips containing music, sound effects, and external speech fragments.
Domains include personal development, global socio-economic topics, entertainment, automotive industry insights, and language education. This diversity introduces terminology from business, psychology, technology, politics, and grammar instruction.
All dataset transcriptions were manually created, double-reviewed by humans, and normalized to provide a consistent reference for evaluation. The normalization process included removing punctuation and ignoring capitalization; otherwise, the ground truth transcription remained unchanged.
Each provider's API was carefully integrated according to official documentation, ensuring fair and accurate comparison.
All models were evaluated in asynchronous (file/batch) transcription mode to ensure consistency in testing.
This benchmark checks how well a Speech-to-Text model performs in real situations, not just on clean and simple audio. The dataset allows evaluation across the following points:
Different English accents help test if the model can understand speakers from various regions.
Some samples have music or sound effects. This checks if the model can still produce correct text when noise is present.
Conversations and interviews test whether the model can handle more than one speaker without mixing their words.
Videos include terms related to business, vehicles, grammar, and global topics. This checks if the model can understand and transcribe uncommon or subject-specific words.
Speakers talk at different speeds. This tests if the model can handle fast speech or long sentences without errors.
Not all recordings are equally clear. This checks if the model can work well with both clean and slightly noisy audio.
| Metric | SL STT | Deepgram nova-2 | Google Chirp 3 | ElevenLabs Scribe v2 | Microsoft Azure | Sarvam Saaras v3 |
|---|---|---|---|---|---|---|
Average WER (%) | 13.27 | 17.53 | 24.47 | 23.19 | 21.93 | 13.55 |
Average CER (%) | 11.36 | 10.16 | 12.24 | 10.16 | 8.45 | 13.41 |
Total Words | 46482 | 46482 | 46482 | 46482 | 46482 | 46482 |
Total Deletions | 3200 | 4400 | 4600 | 4300 | 3900 | 1816 |
Total Insertions | 900 | 1300 | 3000 | 3100 | 2200 | 1861 |
Total Substitutions | 2066 | 2445 | 3771 | 3376 | 4096 | 2621 |
Total Hits | 41216 | 39637 | 38111 | 38806 | 38486 | 42045 |
Total Errors | 6166 | 8145 | 11371 | 10776 | 10196 | 6298 |
Result: SandLogic STT achieved the highest speech recognition accuracy in English.
| Metric | SL STT | Deepgram nova-2 | Google Chirp 3 | ElevenLabs Scribe v2 | Microsoft Azure | Sarvam Saaras v3 |
|---|---|---|---|---|---|---|
Average WER (%) | 16.5 | 13.8 | 29.55 | 19.99 | 29.35 | 17.52 |
Average CER (%) | 10.95 | 7.75 | 10.83 | 10.22 | 12.29 | 13.51 |
Total Words | 54833 | 54833 | 54833 | 54833 | 54833 | 54833 |
Total Deletions | 3600 | 3100 | 6100 | 4100 | 6300 | 2526 |
Total Insertions | 1100 | 1200 | 4300 | 2600 | 4200 | 1050 |
Total Substitutions | 4347 | 3266 | 5806 | 4263 | 5585 | 6035 |
Total Hits | 46886 | 48467 | 42927 | 46470 | 42948 | 45222 |
Total Errors | 9047 | 7566 | 16206 | 10963 | 16085 | 9611 |
Result: Deepgram nova-2 achieved the highest speech recognition accuracy in Hindi.
This evaluation provides a transparent and rigorous comparison of speech recognition performance across industry-leading providers for English and Hindi languages. The real-world YouTube datasets ensure benchmarks reflect authentic deployment scenarios.