WER
Word Error Rate — the standard accuracy metric for speech recognition, measuring the percentage of words incorrectly transcribed.
WER is calculated as the sum of substitutions, insertions, and deletions in the produced transcript, divided by the total number of words in the reference. Lower WER is better; a WER of 0% would mean perfect transcription.
Modern ASR systems achieve WER below 5% on clean audio with a single speaker in major languages — well above the 'understandable' threshold relevant for accessibility compliance.
WER alone doesn't capture all dimensions of caption quality. A captioning system might have low overall WER but consistently mis-transcribe brand names, technical jargon, or speaker names — errors that matter more to readers than the average suggests. Custom vocabulary mitigates this.