Apple Introduces Denoising LM for Correcting Errors in ASR Systems

Apple has introduced Denoising Language Model (DLM), a scaled error correction model trained with extensive synthetic data, surpassing previous methods and achieving SOTA automatic speech recognition (ASR) performance. 

The process involves using text-to-speech (TTS) systems to create audio that is fed into an ASR system, generating noisy hypotheses. These hypotheses are paired with the original texts to train the DLM. 

The approach includes several key elements: an up-scaled model and data, multi-speaker TTS systems, a variety of noise augmentation strategies, and new decoding techniques. Using a Transformer-CTC ASR, DLM achieves a 1.5% word error rate (WER) on test-clean and 3.3% WER on test-other on LibriSpeech, which are the best reported numbers without using external audio data, and match results from self-supervised methods that do use external data. 

A single DLM can be applied to different ASRs, significantly outperforming conventional LM-based beam-search rescoring. These results suggest that well-designed error correction models can potentially replace traditional LMs, leading to a new level of accuracy in ASR systems.

A significant challenge for error correction models is the need for a large number of supervised training examples, which are limited in typical ASR datasets. DLM addresses this issue by using TTS systems to generate synthetic audio, which is then fed into an ASR system to create hypotheses paired with the original text, forming the training dataset. This method allows for scaling up the training data by using a larger language corpus.

Main Contributions

  • State-of-the-Art ASR Error Correction: DLM achieves a word error rate (WER) of 1.4% on dev-clean and 3.1% on dev-other, and 1.5% on test-clean and 3.3% on test-other on LibriSpeech without using any external audio data.
  • Key Ingredients of DLM:
    • Multiple zero-shot, multi-speaker TTS systems to generate audio in different styles.
    • Mixing real and synthetic data during training to maintain grounding.
    • Combining multiple noise augmentation strategies like frequency masking of spectral features and random substitution of ASR hypothesis characters.
  • Universal, Scalable, and Efficient:
    • The same DLM can be applied to different ASR systems and datasets.
    • Performance improves with the increasing number of speakers, model size, and training text corpus size.
    • Can match traditional neural LM results using simple greedy decoding without the need for heavy beam search decoding and rescoring.

Three aspects of DLM scalability were tested: number of speakers, model size, and training dataset size. Results show that increasing the model size decreases WER, DLMs outperform neural LMs of the same size, and more training data leads to further gains.

DLMs trained on synthetic audio perform comparably to those trained on real audio, with only small differences in WER, demonstrating that high-quality TTS is not necessary for effective error correction.

The post Apple Introduces Denoising LM for Correcting Errors in ASR Systems appeared first on AIM.