Published on

Comparing mainstream transcription providers

Authors
  • avatar
    Name
    Gregory Swart
    Twitter
Comparing mainstream transcription providers

At Comtura, our goal is to extract actionable information about deals from meeting recordings and deliver it in an easily digestible format to our users. Consequently, transcription quality is paramount for us - bad transcription quality limits our ability to deliver insights, while a high quality transcript allows us to give an overview of the meeting to our users without them having to listen to the whole thing, wasting time.

To find out how to get the best quality transcriptions, I compared four speech-to-text transcription providers: AssemblyAI, AWS, the Market Leader, and Zoom. The metrics I was interested in were Word Error Rate (WER), Match Error Rate (MER), and Word Information Lost (WIL).

Error metrics across the four transcript providers

Word Error Rate

Is the ratio of errors to total number of words in the Input (original) text.

Match Error Rate

Is the ratio of errors to total number of Input/Output matched words. The number of Input/Output matched words are computed by comparing the Input (original) and Output (transcribed) texts, identifying corresponding words and accounting for any deletions and insertions during the transcription process. Word Information Lost is equal to 1 - Word Information Preserved (WIP). WIP is derived by taking the ratio of Hits (correctly transcribed words) to the number of words in the Input text multiplied by the ratio of Hits to the number of words in the Output text.

Word Information Lost

Is equal to 1 - Word Information Preserved (WIP). WIP is derived by taking the ratio of Hits (correctly transcribed words) to the number of words in the Input text multiplied by the ratio of Hits to the number of words in the Output text.


I manually transcribed an audio recording of a SaaS sales demonstration of 30 minutes (24000 characters or roughly 8 and a half pages) and used this as a reference. In this comparison I only considered a word correctly transcribed if the case and punctuation were correct (excluding commas), since I considered correctly transcribing participant names and identifying sentence boundaries to be important. The results are the following:

Transcript providers comparision

Column chart of each error metric for the four providers tested on a 30-minute sales demonstration - the lower the metric, the better the transcription quality

We can see that AssemblyAI boasts the best metrics followed by AWS, the market leader, and Zoom transcription scoring comparable metrics. All transcription methods performed much better towards the middle of the demo, since lengthy monologues, where the participants don’t interrupt are typical here. Towards the beginning and end of the sales demonstrations, participants tend to engage in small talk and interrupt one another more often, leading to worse transcription quality.

Some subjective remarks:

  • AssemblyAI seemed to cope particularly well with names of companies and people, a very important aspect when analysing these kinds of texts.
  • While AWS, the market leader, and Zoom achieved similar results, AWS was easiest to read because fewer errors changed the sentences dramatically in my opinion, while Zoom was the hardest to understand.
  • The Market leader seems to be very good at identifying sentence boundaries.

I also created a heatmap of Match Error Rate per 100 input/output matched words. In each transcription, the input/output matched words increase if more deletions and insertions were made, so the text lengths do not match - “-1” acts as a placeholder when a transcription method has fewer I/O matched words than the others.

match error rate comparison per 100 words for transcript methods

Heatmap of Match Error Rate (MER) per 100 words in the I/O matched text, the lower the MER, the better. For each column, the top rectangle corresponds to the first 100 words and so on. I/O matched text lengths differ due to different number of deletions and substitutions, so “-1” was added as a placeholder to resolve these length differences. Higher resolution image available here.

Some remarks about the heatmap:

  • In the first comparison we see that overall, AssemblyAI performs the best
  • We can see clearly that MER spiked at certain times when the participants switched to a more informal manner of speech, resulting in more interruptions and parallel speaking (batch 18-19, 31-33, and small talk towards the very end of the meeting).
  • While AWS and the market leader scored similar overall metrics, from the heatmap we can see that the market leader has a higher variance among their scores, while AWS seems to be less accurate overall, but is more consistent, dealing with parallel speech and interruptions slightly better

While these scores are useful to compare transcription quality in a standardize way, unfortunately these metrics do not take into account how severe the errors are. The best way to really understand how well a transcription service works for you is to read through some transcriptions.

Conclusion

In summary, we saw that of the transcription methods examined using our criteria, AssemblyAI is the definite winner, scoring the best in all three metrics, while the three other providers achieved similar scores across all three metrics. AWS seems to be more robust when faced with simultaneous speaking, while the market leader achieves higher accuracy during lengthy monologues and is good at finding reasonable sentence boundaries.