As part of new efforts towards accessibility, Google announced Project Euphonia at I/O in May: An attempt to make speech recognition capable of understanding people with non-standard speaking voices or impediments. The company has just published a post and its paper explaining some of the AI work enabling the new capability.
The problem is simple to observe: The speaking voices of those with motor impairments, such as those produced by degenerative diseases like amyotrophic lateral sclerosis (ALS), simply are not understood by existing natural language processing systems.
You can see it in action in the following video of Google research scientist Dimitri Kanevsky, who himself has impaired speech, attempting to interact with one of the company’s own products (and eventually doing so with the help of related work Parrotron):
The research team describes it as following:
ASR [automatic speech recognition] systems are most often trained from ‘typical’ speech, which means that underrepresented groups, such as those with speech impairments or heavy accents, don’t experience the same degree of utility.
…Current state-of-the-art ASR models can yield high word error rates (WER) for speakers with only a moderate speech impairment from ALS, effectively barring access to ASR reliant technologies.
It’s notable that they at least partly blame the training set. That’s one of those implicit biases we find in AI models that can lead to high error rates in other places, like facial recognition or even noticing that a person is present. While failing to include major groups like people with dark skin isn’t a mistake comparable in scale to building a system not inclusive of those with impacted speech, they can both be addressed by more inclusive source data.
For Google’s researchers, that meant collecting dozens of hours of spoken audio from people with ALS. As you might expect, each person is affected differently by their condition, so accommodating the effects of the disease is not the same process as accommodating, say, a merely uncommon accent.
A standard voice-recognition model was used as a baseline, then tweaked in a few experimental ways, training it on the new audio. This alone reduced word error rates drastically, and did so with relatively little change to the original model, meaning there’s less need for heavy computation when adjusting to a new voice.
The researchers found that the model, when it is still confused by a given phoneme (that’s an individual speech sound like an e or f), has two kinds of errors. First, there’s the fact that it doesn’t recognize the phoneme for what was intended, and thus not recognizing the word. And second, the model has to guess at what phoneme the speaker did intend, and might choose the wrong one in cases where two or more words sound roughly similar.
The second error in particular is one that can be handled intelligently. Perhaps you say “I’m going back inside the house,” and the system fails to recognize the “b” in back and the “h” in house; it’s not equally likely that you intended to say “I’m going tack inside the mouse.” The AI system may be able to use what it knows of human language — and of your own voice or the contest in which you’re speaking — to fill in the gaps intelligently.
But that’s left to future research. For now you can read the team’s work so far in the paper “Personalizing ASR for Dysarthric and Accented Speech with Limited Data,” due to be presented at the Interspeech conference in Austria next month.