ASR
Automatic Speech Recognition — the technology that converts spoken audio into text, used by all automated captioning systems.
ASR systems take an audio waveform as input and produce a sequence of words as output. Modern ASR uses deep neural networks trained on large speech corpora; accuracy on clean audio with a single speaker now exceeds 95% in major languages.
ASR performance degrades with audio quality: room reverb, multiple overlapping speakers, technical jargon outside the training distribution, and accented speech all increase the word error rate (WER).
Custom vocabulary and speaker adaptation are the standard mitigations for ASR weakness on domain-specific content. Most modern captioning platforms expose these as per-event settings.