Meta Releases Voicebox, Multilingual Speech Generation Model
Meta has been killing it in the generative AI field lately. In another groundbreaking announcement, Meta AI revealed a major breakthrough in the field of generative AI for speech, Voicebox, a multilingual model capable of performing various speech generation tasks through in-context learning, even tasks it was not explicitly trained for.
Check out the paper here.
Amid the concerns around deepfake, the company claims that Voicebox is so competent, that it has decided to not make it open to the public yet.
Voicebox represents a significant leap forward, as it can generate high-quality audio clips and seamlessly edit pre-recorded audio. Notably, it can remove unwanted background noise, such as car horns or barking dogs, while preserving the original content and style of the audio.
Additionally, this versatile AI model is capable of producing speech in six different languages, making it a valuable tool for multilingual applications.
Meta says that this could lend natural-sounding voices to virtual assistants and non-player characters in the emerging metaverse. Furthermore, visually impaired individuals could benefit from AI-generated voices that read written messages in their familiar voices, while content creators gain new and efficient tools for audio track creation and editing in videos, among other exciting possibilities.
- In-context text-to-speech synthesis: By using a mere two-second audio sample, Voicebox can match the style and reproduce it for text-to-speech generation.
- Speech editing and noise reduction: Voicebox can seamlessly recreate interrupted speech segments caused by external noise or replace misspoken words without requiring a complete re-recording. For instance, it becomes akin to an “eraser” for audio editing, enabling users to identify and regenerate specific interrupted portions of speech, such as when a dog barks during a recording.
- Cross-lingual style transfer: Armed with a speech sample and a text passage in English, French, German, Spanish, Polish, or Portuguese, Voicebox can deliver a reading of the text in any of these languages, even if the sample speech and the text are in different languages. This breakthrough feature paves the way for natural and authentic communication between individuals who speak different languages.
- Diverse speech sampling: Voicebox has been trained on a wide range of data, resulting in the generation of speech that more accurately reflects how people naturally communicate in the real world and across the six supported languages.
Meta has been dedicated to preserving the languages across the world. For this, the company had also released Massively Multilingual Speech AI research models consisting of 4,000 spoken languages for text-to-speech and speech-to-text capabilities.
This comes after Meta releasing MusicGen, a transformer based music generation platform through text inputs which outperform Riffusion, Mousai, and MusicLM on various performance metrics.
Moreover, Meta AI recently also announced I-JEPA, a self-supervised computer vision model for learning the world by predicting, which is based on Yann LeCun’s vision of autonomous machine intelligence.
The post Meta Releases Voicebox, Multilingual Speech Generation Model appeared first on Analytics India Magazine.


(@AlphaSignalAI) 
