sparsity

Earlier this month, Google released Generalist Language Model (GLaM), a trillion weight model that achieves competitive performance on multiple few-shot learning tasks. The full version of GLaM has 1.2 trillion parameters across 64 experts per MoE layer. There are 32 MoE (mixture of experts) layers in all. However, it activates just 97 billion (8 per cent of 1.2 trillion) parameters per token prediction during interference. GLaM is a sparse language model, which means it activates only a part of the architecture for a given task; in this case, only two experts are used for a given input token. This gives the model more capacity while limiting the computation

GLaM Model

Compare it with Gopher from DeepMind. It is trained on MassiveText, which is a collection of large English-language text datasets from web pages, articles, code, and books; MassiveText contains 2.35 billion documents or 10.5 TB of text. Gopher is what we call a dense language model that activates the entire architecture for any given task at hand.

A Reddit user pointed out that despite being a trillion parameter model and having achieved feats in terms of efficiency and energy saving, GLaM is comparable in terms of performance with a smaller Gopher model. What really separates sparse models from dense ones, and how does it impact an architecture’s overall performance?

Sparsity in neural networks

In 2018, a group of researchers published a paper titled ‘Scalable Training of Artificial Neural Networks With Adaptive Sparse Connectivity Inspired By Network Science’. The authors argued that like biological neural networks, artificial neural networks, too, should not have fully connected layers. They proposed sparse evolutionary training of neural networks. Their technique involved an algorithm that evolves an initial sparse topology of two consecutive layers of neurons into a scale-free topology. “Our approach has the potential to enable artificial neural networks to scale up beyond what is currently possible,” they claimed.

Sparse and dense language models

Most language models are dense. It means that these models use the whole neural network to accomplish a task – regardless of whether they are simple or complex. In most cases, this is considered an inefficient method since a lot of computational resources are employed to achieve a goal that can also be achieved with just a fraction of it. For example, different parts of the brain perform different tasks. Given the situation, only part of the brain is called up on relevant pieces for a given scenario. 

Similar logic can be used when we are working with AI models. Taking the example of a typical transformer, it is composed of an attention layer followed by a dense feed-forward network. This dense layer accounts for the high cost of training transformer models. 

Researchers are building single models that are only sparsely activated. It means that only small pathways through the network are called to action as needed. The model then dynamically learns which part of the network is good for which task. One of the biggest benefits of such architecture is that the model not only gains a larger capacity to learn a variety of tasks but is also much faster and energy-efficient.

Earlier this year, Google released Switch Transformer that had a staggering 1.6 trillion parameters, making it six times larger than GPT-3. With this model, Google unveiled a method to maximise the parameter count in a simple and computationally efficient way. Google introduced an MoE routing layer to enable sparse models instead of super large dense models. Here, the dense layer was replaced by a Switch FFN layer that processes input tokens and decides which smaller feed-forward network should process it. Interestingly, since Switch Transformer uses sparse activation, they consume less than one-tenth of the energy a similar-sized dense model consumes while retaining the same accuracy.

Similarly, the 1.2 trillion parameters sparsely activated model of GLaM was shown to achieve higher results on average and on more tasks than the 175 billion dense GPT-3 model while using less computation during inference. GLaM also achieved competitive results on zero- and one-shot learning. 

Credit: Google AI

Will, more language models adopt sparsity – it needs to be seen.