Shakti LLMs

Abstract

We conducted a comprehensive evaluation of Automatic Speech Recognition (ASR) systems using carefully curated benchmark datasets in English and Hindi sourced from YouTube videos. The benchmarking results are summarized as follows:

Languages evaluated: 2 (English and Hindi)
Total samples: 345 (118 English + 227 Hindi)
Total duration: ~10 hours of audio
Evaluation datasets: Real-world YouTube videos covering diverse acoustic conditions, speaking styles, accents, and topics.
Ground truth transcriptions: Manually transcribed and double-reviewed by humans, then normalized for fair evaluation.
Processing mode: Asynchronous (file/batch) transcription
Results: SandLogic STT achieved highest accuracy in English; Deepgram nova-2 achieved highest accuracy in Hindi

Models Evaluated

Provider	Model Evaluated
SandLogic STT (Speech-to-Text)	Shakti
Deepgram	nova-3 (for English), nova-2 (for Hindi)
Google	Chirp_3
Microsoft Azure	Best/Default
ElevenLabs	Scribe_v2
Sarvam	Saaras_V3

Evaluation Process

1. Dataset Selection: Source Selection Criteria

The benchmark was curated from publicly available long-form online videos based on the following considerations:

a) Authenticity of Human Speech

Preference was given to conversational audio, monologues, and interviews rather than synthesized or scripted speech.

b) Accent Diversity

Samples include American, British, Indian, and regionally influenced English accents to evaluate accent generalization capabilities.

c) Acoustic Variance

Audio was selected based on varying levels of background sound ranging from clean studio-grade recordings to noisy clips containing music, sound effects, and external speech fragments.

d) Topical Breadth

Domains include personal development, global socio-economic topics, entertainment, automotive industry insights, and language education. This diversity introduces terminology from business, psychology, technology, politics, and grammar instruction.

English Content Categories:

Personal productivity discussions
Technical and business analysis
Cross-cultural conversations
Entertainment and career dialogues
Grammar and language tutorials

Hindi Content Categories:

Logical thinking and decision-making talks
Grammar teaching sessions
Motivational speeches
Career guidance talks

2. Transcription & Ground Truth Creation

All dataset transcriptions were manually created, double-reviewed by humans, and normalized to provide a consistent reference for evaluation. The normalization process included removing punctuation and ignoring capitalization; otherwise, the ground truth transcription remained unchanged.

3. Model Integration

Each provider's API was carefully integrated according to official documentation, ensuring fair and accurate comparison.

4. Evaluation Metrics

WER (Word Error Rate):

5. Processing Mode

All models were evaluated in asynchronous (file/batch) transcription mode to ensure consistency in testing.

How the Benchmark helps evaluate STT Models

This benchmark checks how well a Speech-to-Text model performs in real situations, not just on clean and simple audio. The dataset allows evaluation across the following points:

1. Accent Handling

Different English accents help test if the model can understand speakers from various regions.

2. Background Noise

Some samples have music or sound effects. This checks if the model can still produce correct text when noise is present.

3. Multiple Speakers

Conversations and interviews test whether the model can handle more than one speaker without mixing their words.

4. Topic-Specific Words

Videos include terms related to business, vehicles, grammar, and global topics. This checks if the model can understand and transcribe uncommon or subject-specific words.

5. Speaking Speed

Speakers talk at different speeds. This tests if the model can handle fast speech or long sentences without errors.

6. Audio Quality Differences

Not all recordings are equally clear. This checks if the model can work well with both clean and slightly noisy audio.

Benchmark Outcomes

English Benchmark

Total English Samples: 118
Total Audio Duration: 4 hours 56 minutes

Performance Findings

Metric	SL STT	Deepgram nova-2	Google Chirp 3	ElevenLabs Scribe v2	Microsoft Azure	Sarvam Saaras v3
Average WER (%)	13.27	17.53	24.47	23.19	21.93	13.55
Average CER (%)	11.36	10.16	12.24	10.16	8.45	13.41
Total Words	46482	46482	46482	46482	46482	46482
Total Deletions	3200	4400	4600	4300	3900	1816
Total Insertions	900	1300	3000	3100	2200	1861
Total Substitutions	2066	2445	3771	3376	4096	2621
Total Hits	41216	39637	38111	38806	38486	42045
Total Errors	6166	8145	11371	10776	10196	6298

Result: SandLogic STT achieved the highest speech recognition accuracy in English.

Hindi Benchmark

Total Hindi Samples: 227
Total Audio Duration: 5 hours

Performance Findings

Metric	SL STT	Deepgram nova-2	Google Chirp 3	ElevenLabs Scribe v2	Microsoft Azure	Sarvam Saaras v3
Average WER (%)	16.5	13.8	29.55	19.99	29.35	17.52
Average CER (%)	10.95	7.75	10.83	10.22	12.29	13.51
Total Words	54833	54833	54833	54833	54833	54833
Total Deletions	3600	3100	6100	4100	6300	2526
Total Insertions	1100	1200	4300	2600	4200	1050
Total Substitutions	4347	3266	5806	4263	5585	6035
Total Hits	46886	48467	42927	46470	42948	45222
Total Errors	9047	7566	16206	10963	16085	9611

Result: Deepgram nova-2 achieved the highest speech recognition accuracy in Hindi.

Speech-to-Text Benchmark Findings: English and Hindi

English ASR – Error rate comparision

English ASR – Error Types Breakdown

Hindi ASR – Error Rate Comparision

Hindi ASR – Error Types Breakdown

Conclusion

This evaluation provides a transparent and rigorous comparison of speech recognition performance across industry-leading providers for English and Hindi languages. The real-world YouTube datasets ensure benchmarks reflect authentic deployment scenarios.

Key Findings:

•English: SandLogic STT achieved the highest speech recognition accuracy (13.27% WER)
•Hindi: Deepgram nova-2 achieved the highest speech recognition accuracy (13.80% WER)