This Indian has Cracked Multilingual Multimodal
In the field of Natural Language Processing (NLP), researchers have been exploring the use of multilingual data to enhance the performance of monolingual English datasets. Avinash Madasu, a research assistant at the University of North Carolina, Chapel Hill is one of them.
Madasu, who normally works with multimodal models, aims to improve video retrieval performance by leveraging multilingual knowledge transfer. “Multilingual data can serve as a powerful augmentation for monolingual models, but creating such data is labour-intensive,” he said. To overcome this, the researchers use state-of-the-art machine translation models to translate English text captions into other languages, creating high-quality multilingual data which don’t require human labelling.

“This problem has been ignored in previous fields especially in the multimodal setup,” says Madasu and went about addressing this gap.
English to any other language
Madasu proposes a model based on OpenAI’s multimodal model CLIP to effectively adapt multilingual knowledge transfer. The model built by Madasu and team took a video, English captions and multilingual text captions as inputs and extracted joint video-text representations from them. Then, they introduced a Dual Cross-Modal (DCM) encoder block that studied the similarities between English text representations and video representations, as well as the association between video representations and multilingual text representations.
In the common embedding space, their model learned important contextual information from multilingual representations that are missing from English text representations. This understanding that the model gained effectively serves as knowledge transfer. Madasu’s team was then able to validate the performance of their proposed model on a video retrieval dataset demonstrating its superiority over baseline models.
Madasu explains that there are over 900 languages in the world that can be used in the model. However, the team still faces the challenge of a lack of data, particularly for Indian languages as their datasets are scarce. He notes that only Hindi has sufficient usable datasets as data labelling is often outsourced to companies like Amazon Mechanical Turk which may not have annotators for all languages.
Addressing Data Scarcity
While Big Tech companies like Google are trying to collect more data for Indian languages in India, Madasu emphasises that accessibility will still remain as an issue. He says that Google is a restrictive company that does not readily share data with independent researchers because they have invested in making their own datasets. “Without access to this data, there can be no feedback or improvement on the data, and people will be unable to use it,” says Madasu.
He goes on to argue that AI research is a non-profit, public undertaking which anyone can participate in and improve upon. “That is how ChatGPT works, it was made free to all and then collected data from users to continually improve itself,” he says. It is essential that these datasets be available to all because progress thrives on open participation.
Moreover, designing models underlying these languages poses another challenge as they require an understanding of the nuances and linguistic components of languages which differ from English. Madasu emphasises that it is crucial for the model designers to comprehend these linguistic aspects and token associations so that the models can handle these languages effectively.
But then, are LLMs our only option? Can there be not any other way in which we can introduce multilingualism in multimodal? Madasu says that the reason why LLMs are more popular is that you don’t need a supervised way of training. “So you can take large amounts of data for training and it pretty much works for most of the things,” says Madasu.
According to him, unsupervised LLMs do not require data labelling which is why there is more emphasis on using these models. “Although there are other models available, such as statistical models like Hidden Markov Models and Markov Chain models, they do not function the same way as language models,” he says.
He goes on to explain that these statistical models rely on mathematical formulas to derive the next set of tokens based on the previous set of tokens. “The focus has been on language models because of their adaptability and ability to learn without any supervised training. These models can handle large amounts of data without requiring explicit instruction,” he said.
The post This Indian has Cracked Multilingual Multimodal appeared first on Analytics India Magazine.




