Complete Guide to DeLighT: Deep and Light-weight Transformer

Transformer and its numerous variants achieve excellent performance today in various machine learning applications including sequence-to-sequence modeling, language modeling and computer vision tasks. The baseline transformer is still one of the most common choices for language modeling. Most transformer architectures comprise a basic transformer block in both the encoder and the decoder parts. A basic transformer block employs several layers of multi-head attention-based mechanisms to perform its task. One of the major differences between the transformer variants and the baseline transformer is the number of multi-head attention layers they incorporate. Models are scaled either wider or deeper by increasing the units in the hidden layers or stacking more transformer blocks respectively to improve performance. As the number of layers or units increases, the number of parameters in the model increases.

Large-scale transformer variants perform well in their tasks, but they are data-hungry and need careful regularization while training. Developers struggle to handle the issues caused by the very large number of parameters in their transformer-based models. For instance, Text-To-Text Transfer Transformer (T5) is a wide transformer variant with a dimension of 65,000. It has 11 billion parameters. Generative Pre-trained Transformers 3 (GPT-3) is a deep transformer variant with 96 transformer blocks. It has 175 billion parameters!

Here comes the need for a different architectural approach that retains the essence of transformer architecture but employs a relatively less number of parameters. It can save memory, save time and reduce training data requirements. Sachin Mehta and Luke Zettlemoyer of University of Washington, Marjan Ghazvininejad and Srinivasan Iyer of Facebook AI Research and Hannaneh Hajishirzi of Allen Institute of AI introduced a Deep and Light-weight Transformer named DeLighT that allocates parameters more efficiently among the transformer blocks or layers. This approach can be implemented in any transformer variant to make it parameter-efficient without decreasing the performance.

How does DeLighT work?

The Deep and Light-weight Transformer architecture introduces the DeLighT transformation strategy based on the Group Linear Transformation (GLT) principle. It follows an expand-reduce principle to scale the transformer block by width or depth while efficiently distributing the parameters. However, GLT is local in nature that is not suitable for attention-based blocks that capture global context. Here, DeLighT uses feature shuffling similar to channel shuffling in convolution neural networks to capture global context capturing and share information among groups.

DeLighT transformation strategy incorporates GLT principle, feature shuffling, and an input mixer connection efficiently to learn wider and deeper representations.

These wide and deep representations enable the DeLighT Transformer architecture to replace the multi-head attention layers with single head attention layers and replace feed-forward layers with light-weight feed-forward layers. The DeLighT Transformer blocks near input are narrow and shallow, whereas the blocks near output are wide and deep. This allows the architecture to distribute a minimal number of parameters very efficiently.

Performance of DeLighT

The DeLighT was trained and evaluated for neural machine translation and language modelling tasks. A Few of WMT’14 datasets, WMT’16 datasets and WikiText-103 dataset are used.

DeLighT outperforms baseline Transformer while reducing the number of parameters 2.8 times on the WMT’16 En-Ro machine translation task and 1.8 times on the WMT’14 En-Fr machine translation dataset with an increase in BLEU score of 0.4.

DeLighT matches Transformer-XL’s performance in language modeling with 1.5 times fewer number of parameters.

Comparison of DeLighT with present state-of-the-arts in machine translation

Python Implementation

DeLighT requires Pytorch 1.4.0+, Python 3.6+, NVIDIA GPU, NVIDIA NCCL and fairseq tool-kit. Following command downloads the source code from the official repository.

!git clone https://github.com/sacmehta/delight

Install dependencies with the following commands.

 %%bash
 cd delight
 pip install --editable ./

NVIDIA apex library helps faster training. The following commands download the necessary source codes and install them.

 %%bash
 git clone https://github.com/NVIDIA/apex
 cd apex
 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
   --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
   --global-option="--fast_multihead_attn" ./

DeLighT in Neural Machine Translation

WMT’14 En-De Translation task

Download and preprocess the data using the following command.

!bash prepare_nmt_dataset.sh wmt14_en_de

Train the model using the following command. It should be noted that training may need at least 8 v100 GPUs each of memory 32GB.

!python nmt_wmt14_en2de.py --d-m 128

The following code performs evaluation of model and comparison with standard BLEU score.

 # evaluate model with BLEU score
 %%bash
 GEN_RES_FILE=gen_out.out
 python generate.py data-bin/wmt14_en_de/ --path <results_dir>/checkpoint_best.pt --beam 5 --lenpen 0.4 --remove-bpe --batch-size 128 > GEN_RES_FILE
 bash scripts/compound_split_bleu.sh GEN_RES_FILE

WMT’14 En-Fr Translation task

Similar to the English-German translation task, the following commands download and preprocess data, train the model, evaluate it with the WMT’14 English-French translation task.

 # download the data, preprocess it
 !bash prepare_nmt_dataset.sh wmt14_en_fr
 # train the model with a single node of 8 v100 GPUs each of memory 32GB
 !python nmt_wmt14_en2fr.py --d-m 128
 # evaluate model and compare with gold standard BLEU score
 !python generate.py data-bin/wmt14_en_fr/ --path <results_dir>/checkpoint_best.pt --beam 5 --lenpen 0.9 --remove-bpe --batch-size 128 --quiet

DeLighT in Language Modelling

DeLighT was trained and evaluated on the famous WikiText-103 task. The following commands download the necessary dataset as a zipped file to the local machine or cloud environment.

 %%bash
 # download dataset
 cd delight/examples/language_model/
 bash prepare-wikitext-103.sh
 cd ../..

The following commands extract the data and preprocess it using fareseq tool-kit.

 %%bash
 TEXT=examples/language_model/wikitext-103
 # preprocess with fareseq
 fairseq-preprocess \
     --only-source \
     --trainpref $TEXT/wiki.train.tokens \
     --validpref $TEXT/wiki.valid.tokens \
     --testpref $TEXT/wiki.test.tokens \
     --destdir data-bin/wikitext-103 \
     --workers 20

Train the model on a single node of at least 8 v100 GPUs each of 32GB memory,

!python lm_wikitext_103.py --d-m 128

Evaluate the model by generating English text and log the evaluation results using the following command.

!python eval_lm.py data-bin/wikitext-103 --path <checkpoint_dir>/checkpoint_best.pt --max-sentences 2 --tokens-per-sample 512 --context-window 400 --gen-subset test --res-file eval_logs.txt

Wrapping up

DeLighT Transformer outperforms present state-of-the-art models in neural machine translation and language modeling while employing very few parameters. Reduction in the number of parameters enables the models to train with less data, less memory and less time. The strategy of DeLighT transformation is attempted by its developers for machine translation and language modeling tasks alone. Future works may help develop computer vision models using DeLighT transformation strategy that are parameter-efficient.

Note: Illustrations are obtained from the original research paper.

Complete Guide to DeLighT: Deep and Light-weight Transformer

How does DeLighT work?

Performance of DeLighT

Python Implementation

DeLighT in Neural Machine Translation

WMT’14 En-De Translation task

WMT’14 En-Fr Translation task

DeLighT in Language Modelling

Wrapping up

Further reading

Related Posts

Sakana.ai Introduces Transformer2, a Self-Adaptive AI

Google’s New AI Architecture ‘Titans’ Can Remember Long-Term Data

Starlark is Basically Python, But Not Really Python, and That’s Fine