ASR

Also known as: Automatic Speech Recognition, speech recognition, STT, speech-to-text

Automatic Speech Recognition — the technology that converts spoken audio into text, used by all automated captioning systems.

ASR systems take an audio waveform as input and produce a sequence of words as output. Modern ASR uses deep neural networks trained on large speech corpora; accuracy on clean audio with a single speaker now exceeds 95% in major languages.

ASR performance degrades with audio quality: room reverb, multiple overlapping speakers, technical jargon outside the training distribution, and accented speech all increase the word error rate (WER).

Custom vocabulary and speaker adaptation are the standard mitigations for ASR weakness on domain-specific content. Most modern captioning platforms expose these as per-event settings.

Related terms

STT
WER
Custom vocabulary
Live captioning
Diarization

Related terms

See live captioning in action