Hands-on Guide to Reformer – The Efficient Transformer
Ever since The Transformers come into the picture, a new surge of developing efficient sequence models can be seen. The dependency on the surrounding context plays a key role in it. Keeping in mind that the context window used by transformers encompasses thousands of words, its application extends to languages and covers music notes and images.
However, expanding the Transformers context window further may result in failure. The power of transformers lies in the fact that it uses attention. For example, for a text of 100K words, it would require to calculate 100K X 100K matrix at each model layer, and on top of it, we have to save these results for each individual model layer, which is quite unrealistic. Given the enormous computational requirements, these types of large Transformer models either can realistically be trained in very large industrial research laboratories or can only be used on a few paragraphs of text.
Through this write-up, we are going to understand an upgrade of the Transformer model called Reformer, which can handle context windows of about 1 million words with efficient usage memory. In Reformer, each part of the standard transformer architecture is re-engineered to optimize for minimum memory requirement without a significant drop in performance. “Reformer: An Efficient Transformer” has been developed by the researchers of Google Research Team: Nikita Kitaev, ?ukasz Kaiser and Anselm Levskaya and presented as a conference paper at ICLR 2020.
Techniques introduced by Reformer to improve the efficiency of transformers:-
- Locality-Sensitive Hashing Attention: The dot-product attention has been replaced by the locality-sensitive hashing technique, which changes its complexity from O(L2) to O(L log L), where L is the length of the sequence. Locality sensitive hashing is a family of methods that map high dimensional vectors to a set of discrete values (buckets/clusters). This method tries to assign vectors that are close in their high dimensional space to the same hash with high probability.
- Chunked Feed Forward Layers: Chunking is a technique that allows to effectively trade better memory consumption for increased time consumption. For more information, click here.
- Reversible Residual Layers: This technique is based on ResNet and allows storing activations only once in the training process instead of N times, where N is the number of layers. For more information, click here.
Long Sequencing Model with Reformer
Step by Step tutorial of how we can use reformer in transformer:
- Check the availability of GPU.
| #@title Check available memory of GPU # Check that we are using 100% of GPU # memory footprint support libraries/code !ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi !pip -q install gputil !pip -q install psutil !pip -q install humanize import psutil import humanize import os import GPUtil as GPU GPUs = GPU.getGPUs() # XXX: only one GPU on Colab and isn’t guaranteed gpu = GPUs[0] def printm(): process = psutil.Process(os.getpid()) print(“Gen RAM Free: ” + humanize.naturalsize( psutil.virtual_memory().available ), ” | Proc size: ” + humanize.naturalsize( process.memory_info().rss)) print(“GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB”.format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal)) printm() |
- If the GPU(util) is not 0 % uncomment the below code and run and then again check for the availability of GPU, if it is 0 % then proceed further.
| #!kill -9 -1 |
- Install all required libraries like transformer, nlp and pyarrow.
| #install the required libraries# install nlp!pip install -qq nlp==0.2.0 #install pyarrow!pip install pyarrow #install transformers!pip install -qq transformers==2.10.0 |
Make sure to install the latest version of libraries; otherwise, the notebook may crash.
- Import the required libraries and methods.
| # importsfrom transformers import ( ReformerModelWithLMHead, ReformerTokenizer, ReformerConfig, Trainer, DataCollator, TrainingArguments,)import nlpimport torch |
- Download the Crime and Punish Book through nlp library and check how the dataset looks.
| # load the datasetdataset = nlp.load_dataset(“crime_and_punish”, split=”train”)dataset |
- Download pre-trained Reformer tokenizer for crime and punish dataset.
| # get a pretrained tokenizertokenizer = ReformerTokenizer.from_pretrained(“google/reformer-crime-and-punishment”) |
- Pack all the data into a single sample by using map function and pad the sample to a length of 52488 and create eight copies of the same sample to create the training samples.
| sequence_length = 2 ** 19 # 524288 # define our map function to reduce the dataset to one sampledef flatten_and_tokenize(batch): all_input_text = [“”.join(batch[“line”])] input_ids_dict = tokenizer.batch_encode_plus( all_input_text, pad_to_max_length=True, max_length=sequence_length ) # duplicate data 8 times to have have 8 examples in dataset for key in input_ids_dict.keys(): input_ids_dict[key] = [8 * [x] for x in input_ids_dict[key]][0] return input_ids_dict # reduce the dataset and set batch_size to all inputsdataset = dataset.map( flatten_and_tokenize, batched=True, batch_size=-1, remove_columns=[“line”]) # prepare dataset to be in torch formatdataset.set_format(type=”torch”, columns=[“input_ids”, “attention_mask”]) |
Check the dataset.
- It has been said that we don’t want the model to memorize the dataset during training, we want it to find patterns and give conclusions. In the same way, we don’t want the model to learn the dataset by encoding the words in its position embeddings. Hence, we will select random padding to put before versus after the text.
| class ReformerCollator(DataCollator): def __init__(self, max_roll_length): self.max_roll_length = max_roll_length # From the official notebook: “Normally we would have a dataset with many examples, but for this demonstration, we fit a language model on the single novel only. We don’t want the model to just memorize the dataset by encoding the words in its position embeddings, so at each training iteration we will randomly select how much padding to put before the text vs. after it” def collate_batch(self, features): # get random shift int random_shift_length = torch.randint(self.max_roll_length, (1,)).item() # shift input and mask rolled_input_ids = torch.roll( features[0][“input_ids”], random_shift_length ).unsqueeze(0) rolled_attention_mask = torch.roll( features[0][“attention_mask”], random_shift_length ).unsqueeze(0) # return dict having the correct argument naming return { “input_ids”: rolled_input_ids, # BS x SEQ_LEN “labels”: rolled_input_ids, # BS x SEQ_LEN “attention_mask”: rolled_attention_mask, # BS x SEQ_LEN } |
- With the Trainer framework of transformers, we can implement by using a Reformer specific DataCollator that randomly shifts the input_ids to the right and sets the labels correctly.
To instantiate the data collator the length of padded input_ids needs to be calculated.
| # the non_padded_sequence_length defines the max shift for our data collatornon_padded_sequence_length = sequence_length – sum( dataset[“attention_mask”][0]) # get the data collatordata_collator = ReformerCollator(non_padded_sequence_length) |
- Next, we will define our reformer model by defining the ReformerConfig. As it can be seen we alternate between local attention layers and lsh attention layers to have a total of 6 layers. Note that we factorize the num_buckets and use Axial Position Embeddings.
| config = { “attention_head_size”: 64, “attn_layers”: [“local”, “lsh”, “local”, “lsh”, “local”, “lsh”], “axial_pos_embds”: True, “sinusoidal_pos_embds”: False, “axial_pos_embds_dim”: [64, 192], “axial_pos_shape”: [512, 1024], “lsh_attn_chunk_length”: 64, “local_attn_chunk_length”: 64, “feed_forward_size”: 512, “hidden_act”: “relu”, “hidden_size”: 256, “is_decoder”: True, “max_position_embeddings”: 524288, “num_attention_heads”: 2, “num_buckets”: [64, 128], “num_hashes”: 1, “vocab_size”: 320, “lsh_attention_probs_dropout_prob”: 0.0, “lsh_num_chunks_before”: 1, “lsh_num_chunks_after”: 0, “local_num_chunks_before”: 1, “local_num_chunks_after”: 0, “local_attention_probs_dropout_prob”: 0.025, “hidden_dropout_prob”: 0.025,} config = ReformerConfig(**config)model = ReformerModelWithLMHead(config)model = model.train() |
- Lastly, let’s set up the training args.
Note: These training arguments are not fully tested so can be changed for fine-tune.
| # define the training argstraining_args = { “learning_rate”: 1e-3, “max_steps”: 2000, “do_train”: True, “evaluate_during_training”: True, “gradient_accumulation_steps”: 8, “logging_steps”: 50, “warmup_steps”: 500, “weight_decay”: 0.001, “fp16”: True, “per_gpu_train_batch_size”: 1, “per_gpu_eval_batch_size”: 1, “save_steps”: 50, “output_dir”: “./”}training_args = TrainingArguments(**training_args) |
- We define a simple “accuracy” metric to keep track of how many samples are correctly predicted.
| def compute_metrics(pred): non_padded_indices = (pred.label_ids != -100) # correctly shift labels and pred as it’s done in forward() labels = pred.label_ids[…, 1:][non_padded_indices[…, 1:]] pred = np.argmax(pred.predictions[:, :-1], axis=-1)[non_padded_indices[…, :-1]] acc = np.mean(np.asarray(pred == labels), dtype=np.float) return {“accuracy”: acc} |
- To enable training with mixed precision, we need to download the apex package and connect it to CUDA. The following command does all that, but might take a while to finish.
| import os, sys, shutilimport timeimport gcfrom contextlib import contextmanagerfrom pathlib import Pathimport randomimport numpy as np, pandas as pdfrom tqdm import tqdm, tqdm_notebook @contextmanagerdef timer(name): t0 = time.time() yield print(f'[{name}] done in {time.time() – t0:.0f} s’) USE_APEX = True if USE_APEX: with timer(‘install Nvidia apex’): # Installing Nvidia Apex os.system(‘git clone https://github.com/NVIDIA/apex; cd apex; pip install -v –no-cache-dir’ + ‘ –global-option=”–cpp_ext” –global-option=”–cuda_ext” ./’) os.system(‘rm -rf apex/.git’) # too many files, Kaggle fails from apex import amp |
- Finally we can start training. Since Google Colab only gives us a single GPU, this might take quite some time.
| # create the trainertrainer = Trainer( model=model, args=training_args, compute_metrics=compute_metrics, data_collator=data_collator, train_dataset=dataset, eval_dataset=dataset, prediction_loss_only=True,)# traintrainer.train() |
- Let’s see what the model learned to generate.
| print(tokenizer.decode(model.generate(tokenizer.encode(“Later that day, he”, return_tensors=”pt”).to(model.device))[0])) |
Conclusion:
Through this discussion on Reformer, we can conclude that the more advancements in contextual data models in transformers can lead to new state-of-the-art technology in different domains. The above explanation of the model represents only one aspect of reformer. In addition to generating very long text, Reformer can also be used for many generative tasks like time-series forecasting, music, image and video generation.
Tutorials and other resources used above:
- Reformer Research Paper
- Reformer Official Github
- Reformer Huggingface
- Reformer Huggingface Blog
- Official Tutorials
- Colab Notebook
The post Hands-on Guide to Reformer – The Efficient Transformer appeared first on Analytics India Magazine.




