ChatGPT is NOT Getting Dumber, You Are
Open AI introduced custom instructions for ChatGPT yesterday. This feature allows users to add specific requirements to your prompts and these instructions will be ‘considered’ in every conversation going forward so users don’t have to keep repeating themselves. Will this update help the recent criticisms of their poor responses?
For months, users have taken to different platforms to complain about the lower performance of GPT-4. There is a continuous discussion on OpenAI’s own forum over the details of all the ways GPT’s performance has dropped. The company’s VP of Product, Peter Welinder, however, dismissed these claims and tweeted saying, “No we haven’t made GPT-4 dumber. Quite the opposite..” Regardless, the poor performance could be causing a decline in the number of users on the platform.
Researchers test these claims
To systematically study these assertions, researchers at Stanford University and UC Berkeley, explored how ChatGPT’s behaviour has changed over time. They published the paper on Tuesday which confirmed that GPT’s responses to questions have indeed changed over time.
The paper assesses the chatbot’s abilities in maths, code generation, problem solving and answering sensitive questions. The two time points are only a few months in March and June this year. It hasn’t been a surprise to most that the findings of the research corroborates that GPT-4’s performance has decreased in all these areas.
In Math the accuracy of the responses dropped from 97.6% to 2.4%. In code generation, the directly executable generations decreased from 52% to 10% with added errors in June. Answering sensitive questions also declined to 5% of the queries answered compared to 21% in May. Only in visual reasoning the overall performance saw a slight improvement of 2% of the exact match rate from March to June.
It is interesting to note that GPT-3.5 has improved in comparison to its successor in maths. On the whole, GPT-3.5 has also improved in answering sensitive questions and in visual reasoning from its previous benchmark.
Response to the paper
This study evaluates the variations in the behaviour over a short period of time but does not mention why this is happening.
The paper concludes GPT-4 does not do well even with the popular the popular Chain-of-Thought technique which significantly improves answers. Off late, GPT-4 did not follow this trend, skipping the intermediate steps and answering incorrectly.
The assumption of experts is that OpenAI is pushing changes continuously, fine tuning the models, and there are no methods of evaluating how this process works or whether the models are improving or regressing with the changes.
Others are discussing the inversely proportional relationship between alignment and usefulness and the increased alignment of the models along with attempts of making it faster and cheaper is contributing to its errors.
Only behaviour, not GPT-4’s Capabilities
One group of experts questioned the very basis of the paper. Simon Willison tweeted that he found the paper relatively unconvincing. He further said, “A decent portion of their criticism involves whether or not code output is wrapped in Markdown backticks or not.”
He also finds other problems with the paper’s methodology. “It looks to me like they ran temperature 0.1 for everything,” he said. “It makes the results slightly more deterministic, but very few real-world prompts are run at that temperature, so I don’t think it tells us much about real-world use cases for the models.”
Arvind Narayanan, CS professor at Princeton also explains that the paper misrepresents the GPT-4 and that to say it has degraded over time is an oversimplification of what the paper found. He also questioned the methods used by the scientists saying the capabilities of the not isn’t the same as its behaviour.
At the end of Arvind’s analysis he says, “In short, the new paper doesn’t show that GPT-4 capabilities have degraded. But it is a valuable reminder that the kind of fine tuning that LLMs regularly undergo can have unintended effects, including drastic behaviour changes on some tasks. Finally, the pitfalls we uncovered are a reminder of how hard it is to quantitatively evaluate language models.”
It makes it even more difficult to assess language models when ones like OpenAI take a closed approach to AI. Sam Altman refuses to reveal the source of training materials, code, neural network weights or even the architecture of the model. Leaving the rest of us to only speculate and arrive at the results through anonymous sources. This leaves researchers to grope in the dark to define the properties of the system they’re trying to evaluate.
Learn to Prompt Better
Good prompts are the antidote to the sickness GPT is going through. It arguably might have gotten worse over time in but the only way to get the responses you need is to make sure your giving it the right prompts. There are multiple courses online that takes you through prompts for specific tasks. In addition understanding how each language models are trained improves your chances with formulating better prompts. Simple yet effective practices include being specific, asking for a step by step explanation, including context, mentioning tone, style and examples improve the responses.
The post ChatGPT is NOT Getting Dumber, You Are appeared first on Analytics India Magazine.




