Abstract

We conducted a comprehensive evaluation of Automatic Speech Recognition (ASR) systems using carefully curated benchmark datasets in English and Hindi sourced from YouTube videos. The benchmarking results are summarized as follows:

  • Languages evaluated: 2 (English and Hindi)
  • Total samples: 345 (118 English + 227 Hindi)
  • Total duration: ~10 hours of audio
  • Evaluation datasets: Real-world YouTube videos covering diverse acoustic conditions, speaking styles, accents, and topics.
  • Ground truth transcriptions: Manually transcribed and double-reviewed by humans, then normalized for fair evaluation.
  • Processing mode: Asynchronous (file/batch) transcription
  • Results: SandLogic STT achieved highest accuracy in English; Deepgram nova-2 achieved highest accuracy in Hindi

Models Evaluated

ProviderModel Evaluated
SandLogic STT (Speech-to-Text)Shakti
Deepgramnova-3 (for English), nova-2 (for Hindi)
GoogleChirp_3
Microsoft AzureBest/Default
ElevenLabsScribe_v2
SarvamSaaras_V3

Evaluation Process

1. Dataset Selection: Source Selection Criteria

The benchmark was curated from publicly available long-form online videos based on the following considerations:

a) Authenticity of Human Speech

Preference was given to conversational audio, monologues, and interviews rather than synthesized or scripted speech.

b) Accent Diversity

Samples include American, British, Indian, and regionally influenced English accents to evaluate accent generalization capabilities.

c) Acoustic Variance

Audio was selected based on varying levels of background sound ranging from clean studio-grade recordings to noisy clips containing music, sound effects, and external speech fragments.

d) Topical Breadth

Domains include personal development, global socio-economic topics, entertainment, automotive industry insights, and language education. This diversity introduces terminology from business, psychology, technology, politics, and grammar instruction.

English Content Categories:

  • Personal productivity discussions
  • Technical and business analysis
  • Cross-cultural conversations
  • Entertainment and career dialogues
  • Grammar and language tutorials

Hindi Content Categories:

  • Logical thinking and decision-making talks
  • Grammar teaching sessions
  • Motivational speeches
  • Career guidance talks

2. Transcription & Ground Truth Creation

All dataset transcriptions were manually created, double-reviewed by humans, and normalized to provide a consistent reference for evaluation. The normalization process included removing punctuation and ignoring capitalization; otherwise, the ground truth transcription remained unchanged.

3. Model Integration

Each provider's API was carefully integrated according to official documentation, ensuring fair and accurate comparison.

4. Evaluation Metrics

    WER (Word Error Rate): Measures transcription errors at the word level. Lower values indicate superior performance.

5. Processing Mode

All models were evaluated in asynchronous (file/batch) transcription mode to ensure consistency in testing.

How the Benchmark helps evaluate STT Models

This benchmark checks how well a Speech-to-Text model performs in real situations, not just on clean and simple audio. The dataset allows evaluation across the following points:

1. Accent Handling

Different English accents help test if the model can understand speakers from various regions.

2. Background Noise

Some samples have music or sound effects. This checks if the model can still produce correct text when noise is present.

3. Multiple Speakers

Conversations and interviews test whether the model can handle more than one speaker without mixing their words.

4. Topic-Specific Words

Videos include terms related to business, vehicles, grammar, and global topics. This checks if the model can understand and transcribe uncommon or subject-specific words.

5. Speaking Speed

Speakers talk at different speeds. This tests if the model can handle fast speech or long sentences without errors.

6. Audio Quality Differences

Not all recordings are equally clear. This checks if the model can work well with both clean and slightly noisy audio.

Benchmark Outcomes

English Benchmark

  • Total English Samples: 118
  • Total Audio Duration: 4 hours 56 minutes

Performance Findings

MetricSL STTDeepgram nova-2Google Chirp 3ElevenLabs Scribe v2Microsoft AzureSarvam Saaras v3
Average WER (%)
13.2717.5324.4723.1921.9313.55
Average CER (%)
11.3610.1612.2410.168.4513.41
Total Words
464824648246482464824648246482
Total Deletions
320044004600430039001816
Total Insertions
90013003000310022001861
Total Substitutions
206624453771337640962621
Total Hits
412163963738111388063848642045
Total Errors
616681451137110776101966298

Result: SandLogic STT achieved the highest speech recognition accuracy in English.

Hindi Benchmark

  • Total Hindi Samples: 227
  • Total Audio Duration: 5 hours

Performance Findings

MetricSL STTDeepgram nova-2Google Chirp 3ElevenLabs Scribe v2Microsoft AzureSarvam Saaras v3
Average WER (%)
16.513.829.5519.9929.3517.52
Average CER (%)
10.957.7510.8310.2212.2913.51
Total Words
548335483354833548335483354833
Total Deletions
360031006100410063002526
Total Insertions
110012004300260042001050
Total Substitutions
434732665806426355856035
Total Hits
468864846742927464704294845222
Total Errors
904775661620610963160859611

Result: Deepgram nova-2 achieved the highest speech recognition accuracy in Hindi.

Speech-to-Text Benchmark Findings: English and Hindi

English ASR – Error rate comparision

English ASR – Error Types Breakdown

Hindi ASR – Error Rate Comparision

Hindi ASR – Error Types Breakdown

Conclusion

This evaluation provides a transparent and rigorous comparison of speech recognition performance across industry-leading providers for English and Hindi languages. The real-world YouTube datasets ensure benchmarks reflect authentic deployment scenarios.

Key Findings:

  • English: SandLogic STT achieved the highest speech recognition accuracy (13.27% WER)
  • Hindi: Deepgram nova-2 achieved the highest speech recognition accuracy (13.80% WER)