Microsoft launches Pengi, an Audio Language Model for Open-ended Tasks
Transfer Learning has been instrumental in advancing audio processing and enabling Self-Supervised Learning and Zero-Shot Learning techniques. However, current models lack the ability to generate language for open-ended tasks such as Audio Captioning or Audio Question & Answering. In response to this limitation, researchers from Microsoft have launched Pengi, a groundbreaking Audio Language Model, adopts Transfer Learning to reframe all audio tasks as text-generation tasks. By incorporating both audio and text inputs, Pengi generates free-form text as output without the need for additional fine-tuning. Extensive evaluations involving 22 downstream tasks showcase Pengi’s state-of-the-art performance, underscoring the significant progress achieved in general-purpose audio understanding through the integration of language models with audio models.
The audio language model, harnesses transfer learning by treating all audio-related tasks as text-generation tasks. It operates by taking an audio recording and associated text as input, subsequently producing free-form text as output. The unified architecture of Pengi permits the handling of open-ended and close-ended tasks without necessitating extra fine-tuning or task-specific extensions.
During training, Pengi is exposed to a vast dataset consisting of audio-text pairs. This dataset encompasses diverse audio recordings encompassing human speech, music, and various sounds, along with corresponding text transcripts. The audio recordings undergo processing via an audio encoder, which converts them into a sequence of continuous embeddings. Simultaneously, the text transcripts are processed by a text encoder, converting them into a corresponding sequence of continuous embeddings. These two sequences of embeddings are merged as a prefix to prompt a pre-trained frozen language model. The language model subsequently generates tokens in an autoregressive manner, conditioned on the audio and text input.
To assess Pengi’s capabilities, evaluations are conducted across 22 downstream tasks, encompassing audio captioning, audio question answering, and audio event detection, among others. Pengi achieves state-of-the-art performance across several of these tasks, affirming its efficacy as a potent audio language model applicable to a wide range of tasks.
Examples of Pengi’s functionalities include generating captions for audio recordings, answering questions related to audio recordings, detecting events within audio recordings, translating audio recordings into text, summarising audio recordings, and generating creative text formats such as poems, code, scripts, musical pieces, emails, and letters.
Although still in development, Pengi possesses the potential to revolutionise audio interaction. With Pengi, natural conversations with devices become feasible, enabling unprecedented audio-related capabilities that were previously unattainable.
The post Microsoft launches Pengi, an Audio Language Model for Open-ended Tasks appeared first on Analytics India Magazine.


