- Gregory Swart
Last time I compared some widely-used transcription providers to see how they compare to each other and understand what the strengths and weaknesses are of each. I was also interested in how these providers compare to some off-the-shelf Automatic Speech Recognition (ASR) models. After some browsing I settled on three models from huggingface.co:
- facebook/hubert-large-ls960-ft (Hubert)
- facebook/wav2vec2-base-960h (W2V base)
- facebook/wav2vec2-large-960h-lv60-self (W2V large)
Column chart of each error metric for the seven methods tested on a 30-minute sales demonstration - the lower the metric, the better the transcription quality. This time casing and punctuation was stripped from the text before transcribing.
These models only provide uncased, unpunctuated text as an output, so to provide a fair comparison without overcomplicating the process I stripped casing and punctuation from my existing transcripts and made a comparison across the 7 methods I now had.
_Table comparing the error metrics across all seven methods on unpunctuated, uncased text. In parentheses are the scores of some methods with casing and punctuation included.__deletions
Heatmap of Match Error Rate (MER) per 100 words in the I/O matched text, the lower the MER, the better. For each column, the top rectangle corresponds to the first 100 words and so on. I/O matched text lengths differ due to different number of deletions and substitutions, so “-1” was added as a placeholder to resolve these length differences. Higher resolution image available here.
- From the second comparison we see that Hubert and W2V base score much worse than the previous methods, while W2V large achieved scores comparable to the previous 4 methods.
- Unsurprisingly, the machine learning models also coped worse during the small talk sections in the meeting
In summary, while W2V large achieved promising results, it is important to note that transcribing into unpunctuated, uncased text leaves several tasks to be done. In order to implement it fully we also need to find a solution to resolve casing and punctuation in the transcribed texts, not to mention speaker detection.