Why Speech-to-Speech Translation Is So Important For Google
Google just released an upgraded version of one of its most coveted projects — Translatotron, which takes a step closer to universal translation. Developing the technology that can break the language barrier to communicate with almost anyone is the ultimate dream for AI and machine learning researchers around the world. Called Universal Translation and demonstrated many times in sci-fi movies and books, this technology was listed as one of the ten to be highly valued in the near future by an MIT Technology Review report.
Plenty of research has been going on to facilitate smooth speech-to-speech translation to achieve this much-aspired goal. The main components of such a system are automatic speech recognition to transcribe the source speech as text, machine translation to translate transcribed text into the language of choice, and text-to-speech synthesis to generate speech in the target language.
Google’s Efforts In S2ST
AI-assisted cross-lingual conversation is a challenging problem. To this end, Google introduced Translatotron in 2019. Translatotron is a direct speech to speech translation with a sequence to sequence model. This model does not rely on intermediate text representation (as has been the case with traditional systems). Translatotron offers advantages like improved inference speed, which in turn avoids compounding errors between recognition and translation. This means that the translation is straightforward to retain the original speaker’s voice and handles the words that need not be translated.
That said, despite Translatotron’s ability to produce natural-sounding high-fidelity speech translations, the model underperformed compared to strong baseline cascade speech-to-speech translation systems.
Credit: Google
To remedy this, Google released Translatotron 2 in July this year. The new version that applies a new method of transferring the source speaker’s voice to the translated speech, is an improvement over the original. It outperforms Translatotron by a margin in terms of translation quality and predicted speech naturalness. It has also improved the robustness of the output speech by cutting down on babbling and long pauses.
The original Translatotron could be potentially misused to spoof audio with arbitrary content, as seen with deep fake videos. Translatotron 2 overcomes this challenge by using just a single speech encoder which is responsible for both linguistic understanding and voice capture. With this, the trained models cannot reproduce non-source voices.
Google & Babel Fish
In “The Hitchhiker’s Guide to the Galaxy”, author Douglas Adam wrote about Babel Fish, a small, yellow, leech-like creature that fed on brainwave energy received from its surroundings. The practical upshot of Babel Fish is that when you stick it in your ear, you can understand speech in any language.
Researchers have been working since long to bring Babel Fish like devices to fruition. Turing award recipient Professor Raj Reddy said earlier this year that in ten years’ time, we will have a digital Babel Fish that would be able to translate all the languages of the world. For the uninitiated, Prof Reddy is a pioneer in the field of speech recognition systems. His research work has led to the development of several path-breaking innovations, including Apple Siri. Prof Reddy’s Babel Fish prediction was quickly panned by several critics, calling it his ‘cockeyed techno-optimism’.
While we may have to wait for ten more years to know whether or not Prof Reddy’s prediction is true, it is not to say that there have been no efforts to achieve that.
Talking about Google in particular, in 2017, the tech giant announced a set of Bluetooth earbuds called the Pixel Buds. The most remarkable feature of this is that it can do instant translations between 40 different languages using a Pixel smartphone. Adam Champy, the then Google product manager, wrote in a company blog, “It’s like you’ve got your own personal translator with you everywhere you go. Say you’re in Little Italy, and you want to order your pasta like a pro. All you have to do is hold down on the right earbud and say, “Help me speak Italian”.
Not just these Bluetooth earbuds, speech-to-speech translation is also an important part of Google Translate. It won’t be wrong to assume that with Translatotron, Google wants to push the envelope further in this field too. This technology would greatly impact individuals and businesses that rely heavily on translation or voice synthesis. As per the company, Google Translate has improved at 1 BLEU point per year since 2010, but automatic translation is still a major challenge. Even the most enhanced models falter when it comes to different dialects of a language, producing very literal translations and poor performance when it comes to informal or spoken language.