Now, Transformers are being applied for keyword spotting

The Transformer architecture has been successfully applied across various domains – language processing, computer vision, time series analysis, among others. Researchers have found yet another domain that could do well with Transformer – keyword spotting.

Scientists from ARM ML Labs and the Lund University recently presented a paper titled ‘Keyword Transformer: A Self-Attention Model for Keyword Spotting’ at the recent InterSpeech Conference. Unlike the conventional methods, the researchers presented a range of ways to use the Transformer architecture for keyword spotting, thereby presenting a Keyword Transformer. It is a fully functional self-attention architecture that offers state-of-the-art performance without pre-training or additional data.

Keyword spotting

A voice assistant (like Google Assistant and Siri) pipeline consists of different stages, the first of which is the trigger phase. In this phase, the assistant tries to capture the ‘trigger phrase’ like play or pause. The functions initiated upon listening to these trigger phrases are less compute-intensive than automatic speech recognition (ASR). So these functions can be performed on-device with low latency. This on-device keyword spotting is also useful when no internet connection is available or in case of data privacy issues.

Credit: ARM

Keyword spotting is an important aspect of speech-based user interaction on smart devices, which requires a real-time response and high accuracy to offer a superior user experience. Conventionally, machine learning techniques like deep neural networks, convolutional neural networks, recurrent neural networks, among others, have been used for keyword spotting over other speech recognition algorithms for their better performance.

Attention mechanism for keyword spotting

Attention mechanisms have been used in keyword spotting but only as an extension to the above mentioned neural network. With Keyword Transformer, the researchers have explored the self-attention mechanism independently for keyword spotting. This system proved to outperform the existing mechanisms on a smaller Google Speech Commands dataset without an additional dataset. They also found that applying self-attention is more effective in the time domain rather than the frequency domain.

The researchers behind the Keyword Transformer say that they were heavily inspired by Vision Transformer, which computes self-attention between different image patches. This approach has been applied to keyword spotting, too, in a way that the audio spectrogram patches are taken as input to understand how this technique applies to new domains.

Credit: ARM

In Keyword transformers, the raw audio waveform is preprocessed by dividing the signal into a set of time slots and then extracting Mel-frequency cepstrum coefficient (MFCCs) for each slot. Each set of MFCC is accepted as an input token to the Transformer, and audio features are extracted based on how different time slots interact with each other. This mechanism also makes the features more descriptive than traditional neural networks. The Keyword Transformer outputs a global feature vector that is fed into a multi-layer perceptron (MLP) that classifies audio into keywords or non-keywords.

The researchers have observed that the model benefits from large scale pre-training, rendering 5.5 times latency reduction through model compression and over 4000 times energy reduction through sparsity and hardware co-design.

Audiomer

Despite outperforming its traditional counterparts by quite a margin, Keyword Transformer suffers from a few fundamental drawbacks. Its architecture lacks in terms of parameter and sample inefficiencies; inability to scale due to quadratic complexity; fixed maximum audio length.

To overcome these challenges, a group of researchers developed Audiomer, which combines 1D residual networks with performer attention. It achieves state-of-art performance in keyword spotting in the raw audio waveform. It outperforms previous methods like Keyword Transformer while also being computationally cheaper and parameter efficient. Other advantages include inference on long audio clips due to the absence of positional encoding.

Credit: Audiomer

Transformers’ growing popularity

The machine learning world is moving on from traditional neural networks to Transformers, facilitating the latter’s rise as the next big thing. Transformers are already being chosen for advanced natural language processing and computer vision tasks. With innovation and time, new areas of applications are being discovered. What works in favour of Transformer is the self-attention mechanism where features are dynamically calculated by attending different parts of the input to each other.