The Rise of Lame LLM Papers
Few months back, AIM made GPT-4 take an attempt at India’s toughest exam – UPSC. ChatGPT powered by GPT-4 was able to crack the exam with 162.76 marks. We noted that by altering the inquiry, we could eventually prompt the model to generate accurate responses. However, in the experiment we only considered the first responses from the bot.
Recently, a paper from MIT researchers named Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models was trending. The paper claimed that GPT-4 scored 100% on MIT’s EECS curriculum with a dataset of 4,550 questions and solutions. Sounds like a great feat unless some researchers decided to dig deeper.
Raunak Chowdhari, Neil Deshmukh, and David Koplow from MIT EECS seniors decided to investigate the paper and were left disappointed with the results. The researchers found out that the questions on the paper were incomplete and there was no way that GPT-4 could have found the right answers, and it didn’t.
Moreover, researchers of the paper used GPT-4 to evaluate and score the answers generated by GPT-4. They even continuously prompted GPT-4 until the right answer was achieved. And when the answers were not being corrected, full answers were provided to GPT-4 in the uploaded dataset, so that the model can output it as its own. When the three researchers tried running zero-shot GPT-4 on the dataset, the results were only 62.5%, significantly worse than the 90% correctness claimed by the paper.
The paper has 15 authors. It becomes inexcusable for them to push off the criticism by saying that the paper is not peer reviewed. The problems with the dataset and methodology could have been pointed out and corrected by any one of them. It appears to be an intentional “mistake” to publish false information for the sake of headlines.
The researchers concluded that there has been this trend in the AI research field recently – shrinking timeline for research resulting in finding shortcuts in research. This along with taking GPT-4 as the benchmark for testing LLMs comes with a lot of problems as the reliability is not measured against a human counterpart, just another LLM model which is also hallucinatory.
Where it all started
OpenAI is also no stranger to publishing lame papers. Earlier, researchers were criticising the closed-door policy of the company for publishing absolutely no technicalities in the GPT-4 paper. Next came the paper trying to qualify GPT as a general purpose technology and measuring the technology’s impact on jobs over the years while also promoting everyone to use it.
In the research field, the trend and touted capabilities of GPT-4 have made everyone compare their research with it. GPT-4 at this point is being regarded as “ground-truth” for every new technological advancement, specifically in the field of LLM.
Moreover, the black-box approach of the GPT-4 paper has been followed by anyone and everyone. A user on the HackerNews discussion of the MIT paper says that machine learning is not a scientific field anymore, therefore, “anyone can say and do whatever they like and there’s no way to tell them that what they’re doing is wrong.” It has become like social sciences – unfalsifiable and unreproducible research built on top of another unfalsifiable and non reproducible research.
It is worth noting that there have been no good metrics and meaningful benchmarks for language generation models for decades now. People end up citing and choosing whatever approach and model to compare results with, resulting in no standardised metric for capabilities of a specific approach.
Chasing the gold rush
Currently, a lot of research in the field of LLM and generative AI after the release of OpenAI’s GPT models is just following the trend and hype in a bid to stay relevant. This results in a lot of nuanced research with no proper grounding or credibility, leading to a rush of false papers on the internet.
This definitely sets a bad precedent for the field of AI research, making everyone question the authenticity of every research paper. This also makes one wonder – what fraction of the papers on the internet are actually lame but haven’t faced similar scrutiny? And with the current trend of the “ground-truth” GPT-4 writing papers, the quality is expected to go down even further.
The post The Rise of Lame LLM Papers appeared first on Analytics India Magazine.


