MosaicML has unveiled its latest research, titled “Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws.” This paper challenges existing paradigms in large language model (LLM) scaling laws, introducing a novel approach that incorporates the often-overlooked factor of inference cost.

Click here to read the paper.

Traditionally, LLM scaling laws, such as the widely-used DeepMind Chinchilla scaling laws, have focused solely on estimating changes in model quality based on increased parameter count and training data. However, MosaicML’s research highlights a critical gap in these formulas by neglecting the crucial aspect of inference cost.

The core innovation lies in the modification of the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size. This calculation considers the dual aspects of training and deploying a model of a specified quality while meeting the demands of inference. The researchers conducted a comprehensive analysis, factoring in both computational budgets and real-world costs.

Key findings from the study include:

  • Cost-Effective Training: MosaicML’s approach enables training a large language model from scratch for less than $100, offering a cost-effective alternative for researchers and organisations.
  • Encoder Architecture: The model introduced in the research is an encoder (BERT-like) rather than a decoder. This move underscores the ongoing significance of encoder-only models, with the authors expressing satisfaction at the integration of recent LLM advances into BERT-like architectures.

The modification of the Chinchilla scaling laws is crucial for accurately reflecting the practical challenges faced by LLM researchers. The researchers emphasise that their analysis applies not only in terms of a compute budget but also in real-world scenarios where costs and demands for inference are substantial.

Zhang Peiyuan made TinyLlama breaking laws of scalability as well. The research assistant at Singapore University, had trained a 1.1 billion parameter model called TinyLlama. Based on Llama 2, the ambitious part about this project is that Peiyuan aims to pre-train it on 3 trillion tokens.

As the inference demand approaches pre-training data size, the research indicates a shift in the optimal parameters-to-tokens ratio towards smaller and longer-trained models. However, the authors acknowledge the need for further experimental validation to ascertain the applicability of their formulas, especially in extreme ranges where pre-training tokens exceed model parameters by orders of magnitudes.

The post MosaicML Announces Beyond Chinchilla-Optimal for LLM Scaling Laws in Inference appeared first on Analytics India Magazine.