Clevercast is now able to add AI-powered closed captions to your live stream with a jaw-dropping 99+% accuracy for commonly spoken languages. This is a unique accomplishment, vastly superior to what other AI-powered captions solutions like YouTube and Vimeo are offering.

For an indication of the difference in accuracy and readability with other platforms, we recorded the same live stream with auto-generated captions in Clevercast, YouTube and Vimeo. All recordings are unedited.

The live stream featured a number of different speakers, each with their own speech pattern and accent.

The live stream used excerpts from the following videos available June 20, 2023 under a CC-BY 4.0 license:

Unique speech-to-text technology

What are AI powered closed captions

Traditionally, creating closed captions involved human transcription, which is a costly and difficult task in real-time. AI powered solutions eliminate the need for human captioning by using advanced Automatic Speech Recognition (ASR) algorithms to analyze the live audio and convert it to captions in real time.

Machine learning models are trained on vast amounts of data, allowing them to recognize speech patterns, accents, and context. As a result, their output has improved exponentially over time, capturing the nuances of language and delivering a much higher quality transcription.

How the accuracy of captions is determined

WER (Word Error Rate) and FER (Frame Error Rate) are commonly used metrics to assess the accuracy of closed captions. Both rely on a reference transcript from a human transcribers to detect the number of errors. FER is more suited to evaluate real-time captioning because it counts the number of frames or segments in which the live caption differs from the reference transcript. In our metrics, we exclude cosmetic errors (e.g. text not optimally split into two lines) insofar as they are not hindering the viewer’s experience. By accuracy, then, we mean correctly converting speech to text and rendering it in understandable phrases that are in sync with the live stream.

Thus, an accuracy of 99+% means that, on average, our AI powered captioning gets more than 99 out of every 100 words right and is able to put them in understandable phrases that are in sync with the live stream.¹

This extremely high success rate is only possible thanks to the huge progress in Automatic Speech Recognition (ASR) technology, where different ways of speaking, dialects, accents, vocabulary… no longer lead to inaccuracies in transcription. In addition, our ASR engines recognize most public names and abbreviations, and are able to render them correctly. Nowadays most ‘errors’ result from unclearly pronounced personal names and other not commonly known names.²

¹ This has been tested primarily with English and Spanish content, including many different accents and dialects. However, data from the AI solutions show that this should also apply to other commonly used languages such as French, German, Italian, Portuguese, Japanese and others.

² Misspellings of names, which cannot reasonably be assumed to be known by the AI engine and have not been added as additional context information, are not considered to be an error when calculating the level of accuracy.

How we achieve 99+% accuracy

There is a big difference between Clevercast and most other solutions for live AI captions. How does Clevercast attains 99+% accuracy, while other solutions are stuck between 70% and 90% accuracy? Below, we listed some of the things that lead to better live captions. Of course, it also helps that we specialize in multilingual live streaming, and develop and manage our own streaming infrastructure.


Best-in-class ASR solutions

AI keeps evolving, almost on a daily basis. We benchmark different AI solutions, so Clevercast can automatically select the best engine when a live stream is configured.


Optimized AI

We regularly evaluate the performance of the AI system using validation and test data, and seek to continuously improve and refine the system. This is how we ensure ever-higher quality. 


Enhance the audio input

There are gains to be made with the audio that is sent to the ASR engine, for example in terms of audio quality, background noise, intelligent fragment detection …


Provide maximum context

Clevercast slightly increases the latency of the HLS stream. This way, we are able to send more context to the ASR engine, which leads to a better speech to text conversion.


Avoid wrong predictions

Language models are predictive by nature. To apply correctly what they have learned, they rely on context, background information and specific terms and instructions.


Intelligent AI output processing

Intelligent post-processing is necessary to catch errors, correct misinterpretations, fix grammatical errors, adjust punctuation, words and phrases and much more.


Apply formatting and line breaks

Factor in readability and time constraints when turning the transcription into captions. Consider the pace of speech and avoid captions that are too short or too long.


Synchronize the captions

A great timing involves more than aligning captions with the spoken words. Proper synchronization ensures that the captions appear at the right time and long enough to aid viewers in following the content.

Live captions through speech-to-text conversion

Use our real-time correction interface for 100% accuracy

Our correction interface allows you to edit the AI-generated captions in real-time, just before they are sent to the live stream (and translated into other languages). It lets you change words and move them to different lines for improved readibility.

Making these corrections is a simple task that requires no experience or training. Our intuitive interface allows anyone to edit the captions in a browser with mouse and keyboard. If desired, we can also provide you with professional correctors for your event.

