Standards and Guidelines for Closed Captions

Updated July 18, 2023


Unfortunately, there is no general set of requirements for closed captions in online video.

For Subtitles for the Deaf and Hard Hearing (SDH), though, there have been several standardisation initiatives. But the existing SDH standards are not mandatory for most online video content. Moreover, there is a lack of uniform technical requirements and quality metrics.

However, the vast majority of online captions are speech-only captions, which only contain a text representation of the spoken words. For speech-only captions, no specific standards or guidelines exists. Of course, some SDH guidelines can also be applied to speech-only captions, bearing in mind that they are used by different audiences and for different purposes.

The info on this page therefore mainly applies to SDH captions. For more info on speech-only captions, please refer to the page on Accuracy and Quality of Closed Captions.

Key standards and guidelines

The main standardisation effort in this area is the Web Content Accessibility Guidelines (WCAG) by the World Wide Web Consortium (W3C). WCAG is widely regarded as a best practice for web accessibility and has been adopted into a variety of international legislation. WCAG clearly states the key aspects that SDH Captions should meet, but does not include technical requirements or quality metrics.

There are also a number of international initiatives in this area. Here are some of the most important ones:

  • The U.S. regulations for closed captions includes the 21st Century Communications and Video Accessibility Act (CVAA), FCC rulings and Americans with Disabilities Act. They require SDH captions for certain types of online video, albeit limited. They were created through legislation and case law, parallel to the WCAG. Although only mandatory in the U.S., they also have had an impact on international best practices and ongoing standardisation efforts.
  • Also in the U.S., the Described and Captioned Media Program (DCMP) has developed Captioning Key, a set of voluntary guidelines focused on the accessibility of educational and training materials.
  • One of the oldest and most detailed set of SDH guidelines are the BBC subtitling guidelines, developed by the British public broadcasting company. These (private) guidelines differ from Captioning Key in certain aspects, in part because they cover captioning not only for Web video but also for television.

What makes ‘good’ closed captions

Based on the existing SDH standards and guidelines, we can conclude that good closed captions should meet requirements in terms of accuracy, synchronicity, readability, placement and formatting. SDH captioning additionally requires completeness of the additional information such as music descriptions and background noises.


The accuracy of closed captions is paramount in ensuring that viewers, particularly those with hearing impairments, receive an authentic and reliable representation of the content.

Accurate captions should include proper spelling, grammar and punctuation. They should not contain mistakes, typos, or other types of inaccuracies.

The industry standard is an accuracy of 99% or more. For most vendors, this includes punctuation. But since punctuation is subjective, most vendors only consider the lack of a punctuation marks as an error when it makes the text to be more difficult to understand.

For offline captioning, 99% accuracy doesn’t pose a problem, as the captions can be reviewed and edited while being added to the Video on-Demand content.

For real-time captions, it is more difficult to ensure 99% accuracy. Until 2022, it was thought that this was only possible through manual captioning using a stenotype keyboard. However, with the evolution of AI, it is now possible to generate real-time captions through ASR that are 99+% accurate.

Note: Clevercast also has support for real-time editing of live captions. This way, you can use ASR to add closed captions to a live stream with a slight latency, barely distinguishable from the latency inherent in the HTTP Live Streaming protocol.


Closed captions should be delivered synchronously with the video content they’re attached to, and displayed long enough to make them readable. They should not lag behind or precede the associated audio.

Some rules of thumb:

  • Ideally, captions should match the pace of speaking and change when a scene changes.
  • DCMP suggests captioning between 130 and 160 words per minute, while the BBC recommends 160-180 words per minute.
  • DCMP recommends that captions should last at least 40 frames, but should not be visible for longer than 6 seconds.
  • BBC recommends not to leave gaps between captions that are shorter than a minute and a half, to avoid a jerky effect.

For offline captioning, synchronization can be fine tuned while adding the captions to the Video on-Demand content.

For real-time captioning, a distinction should be made between meetings and live streams. For meetings, where there is very little latency, rolling text is probably the most suitable caption format, as recommended by the BBC. For live streams, with latency, it should be possible to fine-tune the timing to some extent.


Captions should be presented in a way that makes them easy to read and comprehend. Captions should contain punctuation and be free from spelling and grammatical errors.

  • There is a general consensus to show no more than 2 lines of text.
  • Lines should be centre-aligned and, as far as possible, more or less equal in length.
  • A new sentence should preferably start on a new line, unless it is very short.
  • Captions and lines should be broken at logical points. The ideal line-break will be at a piece of punctuation like a full stop, comma or dash.
  • BBC does not prescribe a maximum number of characters per line. DCMP sticks to a (very low) maximum of 32 characters.

Breaking up sentences optimally can be difficult for real-time captioning, as the rest of the sentence may not be known when a caption is generated. If the live stream has latency, this can be solved by intelligent post-processing.

Placement and Formatting

Captions should be placed in a clear and readable manner on the screen, ensuring that they do not obscure relevant visual content or on-screen text. Usually, this means that captions should be at the bottom of the screen. They should have sufficient contrast with the background to facilitate easy reading.

It is recommended to use a semi-transparent rectangular background for captions to help separate the text from the video content. The captions should overlay the video image, and may be placed within any black bars present within the video at the top or bottom.

The text size of the captions should be large enough to ensure readability on different screen sizes. By default, minimum text height should be at least 1/16th of the screen height, but this may vary on certain devices.

Font choice and size should also focus on readability. DCMP recommends using sans serif fonts with medium weight, while BBC recommends using system fonts for readability (e.g. Helvetica for iOS and Roboto for Android).

Subtitles for the Deaf and Hard of Hearing (SDH)

There is a clear distinction between speech-only captions and SDH captions. Whereas speech-only captions only contain a transcription of what was said, SDH captions should also contain sounds and music.

Therefore, SDH captions should be complete. They should include include all significant dialogue, lyrics, and sounds that are essential to understanding the audiovisual content. If necessary, they should also provide context information.

When multiple speakers are involved, SDH captions should also identify the speakers to aid comprehension. This can be achieved through the use of speaker labels, such as speaker names or other visual cues.

In some cases, it is not possible to follow the pace of speech, while also adding context to the captions. In that case, the BBC guidelines recommend editing or paraphrasing the spoken words to keep the captions in sync.

Both the DCMP Captioning Key and BBC Subtitling Guidelines were drafted with SDH captions in mind. They contain numerous recommendations and guidelines to increase accessibility for the hearing impaired.

Offline vs real-time captioning

In WCAG, which is the main standard for SDH subtitles, the level of accessibility determines whether live captions are required or not.

  • Video on-Demand (VoD) captions are required for WCAG Level A, which is the lowest level of accessibility.
  • Live captions are required for WCAG level AA, which is a more advanced level of accessibility.

In practice, WCAG Level AA is increasingly demanded in both the public sector (mandatory) and the corporate world.

Organizations such as the BBC, DCMP, and the Federal Communications Commission (FCC) in the U.S. have established detailed guidelines and best practices for SDH captioning. These guidelines may differ in certain aspects such as text placement, timing, style, and presentation of non-verbal information. They all describe how SDH captions should be displayed, without specifying how to achieve this.

Therefore, some of these rules are only feasible for VoD content, as they require time in which captions can be created, adjusted and reviewed. During a live stream, where captions need to be added in real-time, it is impossible to follow some of the guidelines (e.g. regarding line breaking).

There are few guidelines that are specific to real-time captioning. The BBC has some, but assumes rolling text, which is more suitable for meetings than for one-to-many live streams. It is generally assumed that most requirements for pre-recorded content should also be pursued for live streams. However, there are no instructions on how this should be done and what conditions should at least be met.