How to evaluate quality and accuracy of closed captions

Updated July 24, 2023


Unfortunately, there are no uniform and clear rules to determine and measure quality and accuracy of (speech-only) closed captions.

There are mandatory standards for Subtitles for the Deaf and Hard Hearing (SDH). For example the Web Content Accessibility Guidelines (WCAG), referenced in various international legislation, and the rulings of the Federal Communications Commission (FCC) in the U.S. for previously televised content. But these SDH standards are only mandatory for a very limited subset of online videos.

It is important to note that SDH standards are specifically developed for the hearing impaired, not for standard captions which only contain the spoken words. Of course, some SDH guidelines can be applied to standard captions, bearing in mind that they are used for different purposes by different audiences. Also keep in mind that the WCAG standard and the FCC rulings only contain general principles, not clear guidelines on what good closed captions should be.

In this guide, we’ll try to establish some basic principles to evaluate the quality of standard captions. This will be based partly on existing SDH standards and guidelines and partly on factual standards and best practices in the industry. Please note that the scope of this guide is limited to standard closed captions, and thus doesn’t include SDH-specific conditions.

What determines the quality of closed captions ?

All SDH standards refer to the same four criteria that should be met for closed caption quality. These criteria can also be applied to determine the quality of standard captions: accuracy, synchronicity, readability and formatting.


It goes without saying that captions must be accurate to be understood. The current industry standard requires 99% accuracy or more, both for Video on-Demand and live streams.

Unfortunately, there is no single way to interpret and measure caption accuracy across vendors. For example, correct punctuation is required by most vendors in case of VoD content. For real-time captioning, this is much more ambiguous: since captions are often generated while sentences are incomplete, it is sometimes impossible to know which punctuation mark is needed.


The caption text should be placed in a clear and readable manner on the screen, ensuring that it does not obscure relevant visual content or on-screen text.

It is considered good practice to use a semi-transparent rectangular background for captions, to help separate the text from the video content and ensure that there is sufficient contrast.

The text must be sufficiently large to ensure readability on different screen sizes. But there is no general standard regarding choice of font and text size.


Properly distributing the text across several captions and caption lines, makes it easier to read and comprehend. There is general consensus on these rules:

  • A caption should not contain more than 2 lines.
  • Lines should be centre-aligned and more or less equal in length.
  • A line may contain up to 50 characters, punctuation and spaces included.
  • Captions and lines should be broken at logical points. The ideal line-break will be at a piece of punctuation like a full stop, comma or dash.
  • A new sentence should preferably start on a new line, unless it is very short.

These principles should also be pursued for real-time captioning, but certainly the last two will often be difficult to achieve (as captions are generated for incomplete sentences).


Closed captions should be delivered synchronously with the video content. They should not lag behind or precede the associated audio.

  • Captions should be shown long enough for viewers to understand them.
  • Captions should generally not be visible for longer than 6 seconds.
  • It is considered good practice not to leave gaps between captions that are shorter than a minute and a half (to avoid a jerky effect).
  • Ideally, captions should match the pace of speaking and change when a scene changes

These provisions apply to VoD and live streams. For live streams, captions should remain visible for a sufficiently long time, as the readability of some captions may be sub-optimal (e.g. imperfect line breaks).

Completeness (SDH only)

There is a clear distinction between speech-only captions and SDH captions. Whereas speech-only captions only contain a transcription of what was said, SDH captions should also contain sounds and music.

Therefore, completeness is the fifth quality requirement for SDH captions. They should include include all significant dialogue, lyrics, and sounds that are essential to understanding the audiovisual content. If necessary, they should also provide context information. When multiple speakers are involved, SDH captions should also identify the speakers to aid comprehension. This can be achieved through the use of speaker labels, such as speaker names or other visual cues.

Since this guide is about standard captions, we won’t discuss this criterion any further.

Methods to evaluate closed caption quality

Several methods exist for evaluating closed caption quality, including some (partly) automated ones. However, human assessment remains paramount, as automated methods have limitations and cannot assess all aspects of quality. Therefore, we recommend using an alternative approach to compare (live) captioning services.


Assessment of formatting is partly subjective. After all, elements like font, text size and colour of captions also depend on personal preference. In some cases, these elements can be adapted to the user’s personal liking. Since formatting can largely be judged separately from the captioning process, it often suffices to watch videos and/or live streams and assess how the captions are formatted.


Although there are formulas to assess readability, such as Flesch-Kincaid Grade Level or Gunning Fog Index, it is difficult to apply them automatically for closed captions. After all, the readability of captions cannot be separated from their accuracy and synchronicity, and therefore should be assessed within this context. Therefore, human evaluation and user feedback remain crucial for evaluating the readability of closed captions.


There are automatic alignment analysis methods to measure the synchronization of closed captions. These methods use computational techniques to analyze the timing and alignment of captions with the corresponding audio. However, it’s important to note that these methods may not always be 100% accurate. Therefore, alignment analysis should preferably be combined with human inspection of the caption files.


In practice, accuracy is usually evaluated together with synchronicity and readability by comparing the caption file to a reference file. Below, we have listed various quality metrics that have been developed to compare the error rate of different captions. However, none of these metrics captures the full context or semantic accuracy of the captions.

Metrics for closed caption accuracy

Accuracy is usually assessed by comparing a caption file (e.g. in WebVTT or SRT format) with a reference file that is considered to be captioned correctly. Several metrics exist to determine the deviation from the reference file. Currently, however, none of these metrics is widely used in the streaming industry. This makes it difficult to compare the accuracy claimed by different vendors.

The most well-known model is Word Error Rate (WER), which is primarily used to evaluate the accuracy of Automatic Speech Recognition (ASR) systems. In our opinion, however, the NER model is more suitable for assessing captions because it also takes into account the meaning of the text. So you can derive from it how well the captions can be understood.


Word Error Rate (WER)

WER measures the percentage of words in the captions that differ from the reference transcription of the spoken content. It takes into account substitutions, insertions, and deletions. A lower WER indicates higher accuracy.


Levenshtein Distance

The Levenshtein distance calculates the minimum number of single-character edits (substitutions, insertions, and deletions) required to transform the captions into the reference transcription. A lower Levenshtein distance indicates higher accuracy.


Number, Edition Error, and Recognition Error (NER)

The NER model is designed to determine the accuracy of live subtitles in television broadcasts and events that are produced using speech recognition. It is an alternative to the WER model (Word Error Rate) used in several countries.

Bilingual Evaluation Understudy (BLEU)

Originally developed for machine translation evaluation, BLEU compares the n-gram overlap between the captions and the reference transcription. It provides a measure of how closely the captions match the reference transcription.

A practical approach to evaluating closed captioning

The industry standard for closed captioning accuracy is 99% or more, which is necessary for most types of video content.

However, not all vendors measure accuracy in the same way. In reality, the accuracy will often be lower, which makes closed captions difficult or impossible to understand. Even an accuracy of 95% will cause difficulties: for an average sentence length of 8 words, 95% accuracy means that an error would occur every 2.5 sentences (on average).

Moreover, for a viewer to understand the captions, it is also necessary for the captions to be readable and in sync with the video and audio. So testing for accuracy also involves testing the readability and synchronicity of the captions.

Closed captions should only be considered accurate if they are correctly transcribed and displayed in readable lines that are in sync with the audiovisual content.

Requirements for accurate closed captions

Accurate captioning should at least meet these requirements:


The captions should match the spoken words

This also applies to names and abbreviations, at least to the extent that they are commonly known. Unfamiliar names and abbreviations are preferably communicated in advance (or corrected in real-time), to avoid misspellings.

Punctuation should be present

The exact punctuation requirements may vary, depending on your criteria and on whether captioning is done offline or real-time (for live streams). During real-time captioning, it is sometimes difficult to predict where a sentence is going, which can result in anomalous or even missing punctuation marks.


The captions should be comprehensible

The caption lines and the time they are displayed should ensure that the text is easy to understand. This assumes that the text is not split into too many different parts, and that captions are displayed long enough. Other requirements may vary. In particular, ensuring suitable line breaks is more difficult during real-time captioning.


The captions should be in sync

The captions should not lag behind or precede the associated audio.

Determining closed caption accuracy

To determine the accuracy of a captioning solution or service, we strongly recommend that you test it in advance using your own video content. This way, you can count the number of captioning errors and calculate the accuracy.

1. Use test videos or live streams that are similar to your real content

Keep in mind that accuracy may not be the same for every language or type of content. If multiple languages are spoken in your video, make sure to test with videos that contain multiple languages.

2. Create accurate closed captions yourself

The caption files, preferably in WebVTT format, will serve as the reference file.

3. Let the vendor and/or solution create closed captions

Obtain the caption files, in the same format as your reference files. In case of a live stream, you should be able to retrieve the recording afterwards.

4. Compare the obtained caption file to your reference file

Register an error every time the conditions for accurate captioning are not met. These conditions may be more strict, depending on your own criteria.

5. Calculate the accuracy rate

Compare the number of errors (= Errors) to the total number of words (= Totals) in the caption file. You can use this formula to get the accuracy rate:
(Totals – Errors) / Totals * 100 = % Accuracy

Alternatively, you could use the NER model to determine the accuracy rate (step 4 + 5). In that case, you mainly measure whether the captions correctly convey the meaning of the words.

Accuracy of real-time captioning

Adding captions in real-time to a live stream has evolved significantly in recent years. Just a few years ago, accurate real-time captioning required the presence of professional captioners with a stenotype keyboard. Due to the great leaps that AI technology has made, Automatic Speech Recognition (ASR) technologies now provides equally, and possibly even better, accuracy in some cases.

Keep in mind that differences in quality between vendors can be significant. Not only the accuracy of captions can differ, but also their readability and synchronicity. These aspects can also greatly improve the user experience and ensure that content is more easily understood.

The combination of ASR and manual captioning is also possible. In that case, automatically generated captions can be corrected in real-time through a web interface. This lets a human editor correct misspelled names and abbreviations, improve punctuation and adjust line breaks. In the near future, this hybrid solution will probably be the best choice for most high-visibility live streams.