TinyLlama Breaks Chinchilla Scaling Law
Meta’s Llama has been a game changer in LLMs. People thought that models couldn’t go any smaller and perform at the same capabilities as their bigger counterparts. Now, with the given small Llama, people are able to tweak it and make it even smaller, so small that people are questioning how it is even possible. The latest one, TinyLlama, is actually the most impressive one, breaking laws of scalability as well.
Zhang Peiyuan, research assistant at Singapore University, has started training a 1.1 billion parameter model called TinyLlama. Based on Llama 2, the ambitious part about this project is that Peiyuan aims to pre-train it on 3 trillion tokens! The idea is to achieve this within 90 days using only 16 A100-40G GPUs with 24k tokens per second per GPU. For comparison, the estimated cost to train this on AWS server would be around $40,000.
If it works, the model would set a new benchmark and serve applications that require limited computational resources as the 1.1 billion weights only occupy 550MB of RAM. But people are a little sceptical about the project.
Chinchilla steps in
The dataset of 3 trillion tokens is a mix of 70% Slimpajama and 30% Starcoderdata. “What would pre-training a 1.1 billion model for so long achieve?” said a user on HackerNews. “Doesn’t it contradict Chinchilla Scaling Law?”
Chinchilla scaling law basically says that for training a transformer based language model, to achieve optimal-compute, the number of parameters and the number of tokens for training the model should scale in approximately equal proportions.
When it comes to models like GPT or PaLM that are larger in size, the saturation point might come much much later as they have a lot of capacity to train themselves for a longer time, and thus overtake others. According to OpenAI, “We expect that larger models should always perform better than smaller models.” The company believes that a model with fixed size will be capacity-limited.
In other terms, since smaller models have fewer multiplications, they run and train faster. But according to this theory, these models eventually reach a limit of their knowledge learning capacity, dropping down the speed at which they learn. For example, training 2 trillion tokens on a 7 billion model might still be better than training 3 trillion tokens on a 1 billion model.
This is the question with the TinyLlama model. Would it be even reasonable to go pre-train a model on 3 trillion tokens if there is a saturation point? According to people, the 3 trillion token is too high for a 1.1 billion model. But that’s the point of the experiment.
But, Llama disagrees
The debate about if bigger models are always better has been constantly going on and Meta with Llama has been trying to prove it wrong constantly. According to the Llama 2 paper, “We observe that after pretraining on 2 trillion Tokens, the models still did not show any sign of saturation.” This possibly gave Peiyuan the hint that training the model on 3 trillion tokens might still be a reasonable idea.
This brings up the question – if Meta believes that the Chinchilla scaling law is actually becoming a little redundant, why did the company not keep training Llama 2 beyond 2 trillion tokens and possibly release further updates to the model in fewer weeks? The only reason can be that the expected advantage from it would be too small for the company to actually gain something from it.
Or maybe the next Llama would be even tinier and trained just this way with a higher number of tokens. Meta is letting its open source community test the capabilities for them, while it might be doing the same thing behind closed doors.
The amount of information we are fitting inside smaller models has to reach a limit. This project aims to prove otherwise. While we wait and check the progress at the training phase, it would be interesting to note how TinyLlama actually kills Chinchilla scaling law. According to the first checkpoint, TinyLlama is already in competition with StableLM-Alpha-3B and Pythia-1B.
If achieved, it would be a huge feat for making AI models run on single devices. If it does not, Chinchilla might turn out to be a winner. According to Peiyuan, “I have no idea. It is an open trial which offers no promise nor target. The only target is ‘1.1B on 3T’”.
The post TinyLlama Breaks Chinchilla Scaling Law appeared first on Analytics India Magazine.


TinyLlama-1.1B 


