Our Use Case:
Every D2C brand has its call centers set up to interact with customers and resolve their queries and complaints about their products and services. Here at Bewgle, we use our NLP tools to analyze the calls and understand what the customers are saying to find out what they want.
As a first step, we transcript audio calls to text and analyze what customers are saying about their experience with the brands.
Speech-to-text apps and tools:
Speech-to-text is a speech recognition software that enables the recognition and translation of spoken language into text through computational linguistics. It is also known as speech recognition or computer speech recognition. Specific applications, tools, and devices can transcribe audio streams in real-time to display text and act on it.
Problems with phone calls:
Phone calls contain a lot of noise and attenuation and have problems with the spoken language. You must first identify the language in which the speakers are conversing. Sometimes, people say some keywords in their local language, which makes it difficult for tools to identify them.
We followed the speech-to-text apps and tools to transcribe audio recordings but the results weren’t satisfying for our use case:
Wav2Vec2
- Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. This model is trained using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2CTCTokenizer.
- Was not able to detect any word, not even a “Hello”. The output we received from the voice transcript was A AA AR A ON LAATER BAT BORTL A AA AAA UA A O AS NO AS WHY O CRIS A AON AAAA AOA.
- This text doesn’t make any sense.
Augnito
- The tool is designed only for medical practitioners to write prescriptions.
- It is accurate in identifying hard-to-pronounce words like ‘Ophthalmologist’, ’arrhythmia’ and ‘gonorrhea’
- But it failed for our use case in the transcription of audio calls to text.
Google Cloud Platform(GCP)
- The Speech-to-Text API enables developers to convert audio to text in over 125 languages and variants, by applying powerful neural network models in an easy-to-use API.
- GCP was at least able to detect some text phonetically but was poor in terms of detecting keywords. “namaste madam basket order management Inc head office se naraz Hona totally black last time dobara Ek Bar to mere ko Markar Biryani chale gaye the vahan per abhi aapke Jhooth bolate vahan Jana Nahin Jana Hai Aap Jana Upar Se 14 kilometre kilometre Kaisa Hai Sar aap log bataiye”.
Transcribe – Speech-to-Text
- This app is for Mac/iOS devices.
- It had very low accuracy with a lot of junk in the output and had poor results for Hindi.
- https://apps.apple.com/us/app/transcribe-speech-to-text/id1241342461
Trint
- It had very low accuracy in transcribing speech to text.
- A lot of junk was there in the text but surprisingly it was able to detect some words that even GCP was also not able to detect.
- The output, however, can’t be analyzed further
IBM Watson
- IBM Watson Text-to-Speech is an API cloud service that enables you to convert written text into natural-sounding audio in a variety of languages and voices within an existing application or within the Watson Assistant.
- It performed excellently when analyzing speech in English. But it doesn’t support nor phonetically detect Hindi or any of the other Indian languages.
Conclusion:
None of these tools were able to transcribe the audio calls accurately as the audio calls had a lot of noise. Sometimes the calls have attenuation or the language is not specified.