Knowledge Base > Essential Guide to Closed Captions
The Essential Guide to Closed Captions
Updated July 24, 2023
In today’s digital age, closed captions play a vital role in making content inclusive for all viewers. Whether you’re a content creator, video producer, or simply interested in learning about closed captions, this guide is designed to provide you with essential knowledge and practical insights.
What are Closed Captions?
Closed captions are a synchronized textual representation of the audio content in a video, allowing viewers to read along as they watch the video. They are typically displayed at the bottom of the screen and can be turned on or off by the viewer. Each line of text represents a segment of speech or sound, providing a visual representation of what is being said or heard.
Closed captioning were initially designed for people who have difficulty understanding what is being said. Nowadays, closed captions have lots of benefits. Studies have shown that the majority of people without a hearing impairment watch content with captions regularly.
Closed captions have their own specific terminology. Some terms are commonly used to distinguish between different types of closed captions.
Closed vs Open Captions
Closed captions can be turned on or off by viewers. Often, video publishers and/or viewers can usually customize them to a certain extent, like adjusting the font size, color and background.
Open captions are burned into the video and cannot be turned off or customized. Since open captions have to be encoded as part of the video, they are less suitable for use in real-time.
Offline vs Real-time Captions
Offline captions (or “captions for Video on-Demand”) are generated and added to videos after the content has been recorded or produced. Note that most existing captioning guidelines are designed for offline captions.
Real-time captions (or “live captions”), on the other hand, are generated instantaneously as the audio is being spoken or broadcasted, which makes it more difficult to perfect the accuracy and timing of the captions.
Subtitles for the Deaf and Hard of Hearing (SDH Subtitles)
SDH Subtitles or SDH captions must include sound effects, speaker IDs and other non-speech elements. By comparison, standard captions (or speech-only captions) only contain the spoken words.
In most parts of the world, the terms closed captions and subtitles are used interchangeably. In the U.S., however, the term ‘closed captions’ often refers to SDH captions, while ‘subtitles’ refers to speech-only captions.
How are closed captions created
Closed captions can be created by human transcription, artificial intelligence (AI) technology, or a combination of both. Which method is most appropriate depends on factors such as the captioning method (offline or real-time), expected turnaround time, speaker language(s), complexity of the audio, available budget and required accuracy.
A human transcriber (or ‘captioner’) listens to the audio and manually converts the spoken words into a written text format.
In case of offline captions, captioners make use of an application or web interface to subdivide the transcription into captions that are in sync with the audio and video. The Web Accessibility Initiative (WAI) provides some instructions for creating captions for the hearing impaired.
In case of real-time captioning, this is a much more difficult task, which can only be performed by a limited number of skilled professionals using a stenotype keyboard.
Human transcription is currently still required in real-time captioning for the hearing impaired, since this requires context information to be added to the captions. If real-time captioning needs to meet specific standards (e.g. speaker identification) human transcription may also be preferable.
Artificial Intelligence (AI)
Automatic Speech Recognition (ASR) systems use machine learning techniques to analyze and interpret audio signals, extract the spoken words and transcribe them into textual form. This way, speech is converted into text.
Next, an intelligent rendering engine ensures that the transcription is divided into captions, which are in sync with the audio and video.
For offline captions, the use of ASR is increasingly becoming the norm. Major streaming content providers such as Netflix, Amazon Prime Video and Hulu all use AI to generate captions.
ASR is also increasingly being used for real-time captioning. Recent major advances in AI have made 99% accuracy the norm for ASR as well. Depending on the type of live stream, budget and specific captioning requirements, ASR – with or without real-time correction by a human editor – may be a better option than human captioning.
AI with human correction
In this case, a rendering engine based on ASR is used to create the initial captions. To avoid errors and improve the quality of captions, the initial captions are checked and corrected by trained professionals.
For offline captions, there are numerous services that do this for you, each with their own quality standards. There are also applications and SaaS software that allow you to do it yourself.
This may also be an option for real-time captions, depending on the delay of the live stream. Currently, Clevercast is one of the few solutions that offers this, both as an event service with included correctors, and as a self-service solution. An interface is available for correcting the AI captions, which is easy to use for non-professionals editors.
Closed caption formats
There are several protocols for storing and displaying captions in multimedia content. The most popular formats for Web video are WebVTT (Web Video Text Tracks) and SRT (SubRip Subtitle), which are used for both offline and real-time captioning.
In case of a Video on-Demand, WebVTT or SRT files can be uploaded to the VoD platform. Most platforms allow you to review the closed captions before publishing the video, allowing you to check timing and accuracy.
When someone watches the video, the video player accesses a local version of the WebVTT or SRT files on the global Content Delivery Network (CDN) used by the platform. In a live stream, this process happens automatically.
There are also formats and protocols for more specific uses such as TTML, IMSC and CEA-608/CEA-708. For more info, see our closed caption format guide.
Quality and accuracy of closed captions
Closed caption quality is not easy to evaluate, as both the way of captioning and the requirements may be different. Unfortunately, there are no uniform and clear rules to determine and measure quality and accuracy of standard closed captions (= speech-only).
Most of the standards and guidelines for closed captions refer to five criteria that should be: accuracy, synchronicity, readability, formatting and completeness. The majority of standards concern Subtitles for the Deaf and Hard Hearing (SDH), of which Web Content Accessibility Guidelines (WCAG) is the most important one.
Almost no guidelines exist for the most common type of closed captions, being speech-only captions. To determine and measure their quality, one can rely partly on the SDH standards, and partly on industry best practices. However, measuring and comparing the quality of captions remains a difficult task, as most vendors use different criteria. Therefore, we propose a practical approach to evaluate the quality and accuracy of closed captions. You can use it to compare different vendors and services.
In practice, the possible quality of closed captions depends heavily on how they are created. More specifically, the time available for creating the captions is crucial. Besides the quality of the captions, this may also determine how they are displayed.
Offline Captioning for Video on-Demand (VoD)
The accuracy of captions in VoD content depends on the quality of the captioning service employed and the expected turnaround time. In theory, perfect quality closed captions are possible here.
For streaming video content, the use of ASR is increasingly becoming the norm. Major streaming content providers such as Netflix, Amazon Prime Video and Hulu all use AI to generate captions. To avoid errors or inaccuracies, the resulting captions are still often checked and corrected by trained professionals. Most captioning services also have quality control measures in place.
Real-time Captioning for Virtual Meetings
The real-time nature of a virtual meeting makes it difficult to use ASR technology. Because every word has to be translated immediately, the accuracy of an ASR engine will decrease due to a lack of context. As each sentence progresses, an ASR engine will have more context, allowing for better transcription. Therefore, rolling captions could be an option: this allows already visible text to still be corrected. But this usually does not lead to a good user experience.
At present, human-generated captions by professionals with a shorthand keyboard (stenotype) seem to be the best solution for virtual meetings. They are often better at creating literal captions with sufficient accuracy.
Real-time Captioning for Live Streaming
Because of the one-to-many nature of live streaming, there is a short delay between when something happens and when viewers get see it. This is called the ‘latency’ of a live stream. It represents the time it takes for the video and audio data to be captured, encoded, transmitted, and finally decoded and displayed by the video player.
The latency of a live stream varies depending on the streaming protocol and the requirements of the event. Achieving lower latency often involves trade-offs with factors like video quality, reliability and scalability.
By using the live stream’s latency, the accuracy of ASR-generated captions can be increased significantly. For starters, it allows longer audio fragments to be forwarded to the ASR system, so it has more context and will transcribe more accurately. Additionally, it allows for a human to correct the ASR captions before they are displayed. In practice, this makes it possible to achieve a higher quality than human-generated captions at a lower cost.
How are closed captions added to online video?
There are several protocols for storing and displaying subtitles in multimedia content. The most popular formats for Web video are WebVTT (Web Video Text Tracks) and SRT (SubRip Subtitle). In addition, there are protocols for more specific uses such as TTML, IMSC and CEA-608/CEA-708. How these formats are used depends on the streaming technology.
In case of Video on-Demand (VoD), subtitle files can be uploaded separately to the video platform. In case of a cloud recording of a previous live stream, the platform should allow publishing to VoD with closed captions. It may also provides an interface to add or correct the captions manually.
When a viewer watches the video, the video player will access a local version of the caption file (e.g. WebVTT or SRT) on the global Content Delivery Network (CDN) used by the platform.
Closed captions in virtual meetings are usually part of a proprietary platform. Since there should be little or no delay, this is usually done using WebRTC. This makes the technology slightly less robust and more difficult to scale.
When using the HLS or MPEG-DASH protocol, new WebVTT or SRT files are continuously created and made available via a CDN. A video player ensures that they are displayed synchronously with the video and audio streams. Since the player needs to be able to buffer part of the stream, to ensure smooth streaming, the live stream is shown with a certain delay. Since this is done through standardised protocols, different players can be used to display a live stream with captions.
Captions can be added to the live stream in two ways, depending on the platform:
- Sent as part of the broadcast, which is usually done using the CEA-608/CEA-708 protocol. But this approach has many drawbacks: it is very limited, outdated and most streaming services do not support it.
- Added by the streaming platform through real-time AI and/or manual transcription. This is already happening in the vast majority of cases.
Alternatives to closed captions
Besides closed captions, there are other ways to enrich and translate online video.
- Open Captions achieve roughly the same goals, only they require the subtitles to be burned into the video. Open captions are less suitable for live streaming: because the subtitles are an integral part of the video, a separate video stream needs to be broadcasted and delivered for each language.
- Simultaneous Interpretation is another possibility to open up video or live stream to non-native speakers. For the hearing impaired, however, it offers no added value. During live streaming, additional languages can be broadcasted as separate audio tracks or channels. Or they can be added directly to the live stream if your streaming platform supports Remote Simultaneous Interpretation (RSI).
- Separate audio transcripts are sometimes easier to provide, especially in the context of virtual meetings. For viewers, however, they have some drawbacks. The text widget is located on a different part of the screen and contains continuous sentences that may not be completely in sync with video, which makes it harder to focus. Separate audio transcripts are also difficult to use on mobile devices, due to their smaller screen size.