How a machine tries to understand a dictation
Remember in school when you needed to take a dictation? I do, because I was horrible at it! First you needed to understand what the teacher was saying, and then you needed to know how to write this word – something with which I struggled a lot. I was told, “Read more, then you will know more words.” In my case, that didn’t help as much as expected (though it was still the best suggestion ever, as I now I love reading!) The machine has exactly the same issues with dictation as I had as a child: first you need to decipher what sounds are important, and then know the corresponding words to be able to write them down. At least for the machine, the advice to “read more” is quite helpful: as soon as the machine sees a word, it remembers it forever (Unlike me – I needed to read the same word repeatedly, and some I still cannot remember!)
Let us have a closer look at how a machine goes about translating speech into text. To begin with, the input and starting point for ‘speech to text’ is the signal recorded by a microphone. Next you need to segment each of the sounds into syllables (or other small units of sounds.) Once the syllables are delineated, the machine can now look up which words are generated with them. This can give many hypothetical combinations of sequences of words. Each of these potential sequences are now analyzed to determine whether they are grammatically reasonable. If the machine concludes that multiple sequences are possible, it then runs semantic and pragmatic analyses based on the context, which generally results in a single sequence of words. Voilà! Speech is translated into text.
However, it isn’t always that simple. At each step of the process, challenges can arise. First, the input is recorded by a microphone. Thanks to Covid 19 and the prevalence of virtual meetings, we all now know how important a good headset is. As the quality of the microphone decreases, it gets harder and harder to understand the person talking. There are two different issues a low-quality microphone can introduce. The most common one is that the sounds the speaker makes are not properly recorded. Missing syllables or even whole words result in the need to guess to fill in the blanks. The other issue is if too much is recorded: background noise such as moving papers, noises from outside, or other people speaking in the background can all degrade the quality of the translation.
Another issue can arise due to the natural differences in individual speakers. When segmenting the input into syllables, different pitch levels need to be normalized: usually a child with a high voice has a different voice profile than an adult – or simply think about how helium affects a person’s voice. The other issue is the speaking speed. The interval window for the syllable can be of quite different lengths depending on the speakers. For example, in Switzerland there is the cliché that people from Bern speak a lot more slowly than the rest of the country – or you ask Germans about Suisse people, they will tell you that the people in Switzerland tend to speak extremely fast. Simply put, the machine needs to normalize the high and the speed before segmenting the input signal into syllables.
Now we have a sequence of syllables which needs to be translated into words (if you now think about the machine translation post, you are right; this is similar and often described as one of the subfields.) Not only are unknown syllable combinations (unknown words) an issue, but filler words like “ahem” and different pronunciations are also problematic. For example, the German pronunciation for China is different, based on country and region. For those who can read phonetic transcription: Germans mostly pronounce it as ˈçiːnaː but in the south of Germany, Switzerland and Austria, it is pronounced as ˈkiːnaː – on the Duden webpage, you can hear the difference (section “Aussprache”). The English transcription is /ˈtʃaɪnə/. And on that note: the transcription could be a solution for how to handle unknown words – just write the phonetic transcription: the user can then read it out loud and will understand it. This is how Suisse German is written: people just write how they speak their dialect by using the Latin alphabet and umlauts. But who can read the phonetic transcription? Honestly, I cannot.
In the best-case scenario, we have a few sequences of possible words (and no unknown words.) The next thing the machine tries to do is finding the correct sequence of words. How can you define “correct” sequence? You can – and often it is done like this – define it as a well-formed, grammatically correct sentence. This is justifiable when a teacher dictates the textbook to his students. In reality, especially in conversations, there will be sentences which suddenly stop. For example, if you are interrupted by another person. But there are also issues of “word duplications”, like when you flounder and repeat yourself. For this, you need to relax your definition of “correct” – but not too much; otherwise, you will get strange sentences. And in addition to the relaxation, you probably want to delete the duplication of words and only transcribe it once (depending on the use case).
For speech to text, the most common use cases are dictation (e.g. instead of having your secretary who writes everything down, now your computer can do it) and transcription of conversations (e.g. subtitles of political debates). For the second use case, often a second translation into other languages is quite popular. If you have ever listened to a conference given by the Suisse government, you will notice they switch between different languages (German and French), and it makes one wish that subtitles were provided.
To conclude this blog post, I want to point out that there are already a lot of (commercial) solutions of this issue which work quite well. The biggest issue is if the speaker diverges too much from the norm. And speaking as someone living and working in Switzerland, I can confirm that we are doing this all the time. We have many dialects which can be vastly different – ever tried to understand someone from Vaud (French-speaking canton) speaking Suisse German? And even our pronunciation when speaking German can be quite different. Besides the China example from earlier, a personal example is my last name, which would be pronounced differently in Germany (with a strange focus on the ‘o’). And don’t forget my favorite issue: unknown words! As the world is continuing to grow, so is the vocabulary of any spoken natural language. Someday, we will have a solution for it – perhaps by just teaching phonetic transcription.