Just like a vehicle, at the heart of every AI system is the fuel that it is being fed. But rather than gasoline, it’s data and lots of it. The buzz surrounding data in AI has reached a fever pitch, but there are several challenges that researchers are still trying to solve.

In an exclusive conversation with Analytics India Magazine, Yoshua Bengio, AI gave his two cents on addressing the data problems in AI and what is causing that. 

Data is Sparse

Self-supervised learning models were initially introduced in response to the challenges of supervised learning models. One of the major issues was carrying labelled data, which is expensive and sometimes practically impossible. But supervised models face a bigger challenge of being loaded with poor-quality data—alongside scaling of models as it can be trained on mislabelled data—leading to more bias and false output. 

Bengio believes that data is abundant, but accessibility is one of the issues. For instance, in medicine, they may not have access to enough data about a particular phenomenon the researchers are interested in. 

Secondly, a significant issue is having suitable information for different tasks and environments. Bengio said there is very little data when it comes to this scenario. Currently, he is working to introduce notions of causality in neural networks to deal with the issue. “Interestingly, humans seem to be really good at dealing with the sparsity of data on a new task,” Bengio added. 

The Other Side

“Companies have pretty much exhausted the amount of data that is available on the internet. So, in other words, the current large language models are trained on everything that is available,” said Bengio. For instance, ChatGPT which has managed to enthral the world by answering in a “human-adjacent” manner is based on the GPT-3.5 architecture, having 175B parameters. 

According to BBC Science Focus, the model was trained using internet databases that included a humongous 570 GB of data sourced from Wikipedia, books, research articles, websites, web texts and other forms of content. To give you an idea, approximately 300 billion words were fed into the system.

The amount of text that humans produce is going to continue to increase but we’ve sort of reached a limit, Bengio believes. Further growth of systems like ChatGPT in terms of datasets is limited and they still don’t do as well as humans in many respects. So, it’s interesting to ask, what is speeding the demand of data that these systems are trained on, he further added. 

Recently, in a conversation with AIM, Yann LeCun said that he believes that the prime issue is not the data unavailability, but how systems can’t take advantage of the available data. For example, the exposure to language an infant needs to learn the language is very small compared to the billions of texts or images that language models have to be exposed to in order to perform well. 

Taking forward LeCun’s point of view, Bengio said, the magnitude of data that these systems need to get the competence that they have is equal to a person reading every day, every waking hour, all their life, and then living 1000 lives. But a four-year-old is able to answer reasoning questions that these models fail at. These machines know much more than a four-year-old or even any normal human because they’re like encyclopaedic thieves, they’ve read everything, but they don’t understand it as deeply as we do. So, they’re not able to reason with that knowledge as consistently as humans are able to.

Reasoning, in simple terms, refers to conclusions derived from inferring from the information. The aspect of reasoning in humans is beyond the limitations of logic or inferences made from logic. “My belief is that as more data is better, bigger networks are better. But, we’re still missing some important ingredients to achieve the kind of intelligence that humans have,” Bengio concluded. 

The post Yoshua Says Data Sparsity Is An Issue (But Not Really) appeared first on Analytics India Magazine.