At ‘Google for India 2022’ event, IISc-backed non-profit foundation ARTPARK  (AI & Robotics Technology Park) launched their pan-India inclusive language data initiative called ‘Project Vaani’ for open-sourcing datasets, which will amplify the Indian government’s ‘Digital India’ efforts by including more diverse regional and local languages. 

Project Vaani aims to compile comprehensive datasets of transcribed text and spoken language from each district in India. These datasets are open-sourced via Vaani’s website and may eventually also be accessible through other platforms like “Bhashini” of MeitY (Ministry of Electronics and Information Technology)in order to advance research and innovation.

Vaani joins the SYSPIN (Synthesizing Speech in Indian Languages) and RESPIN (Recognizing Speech in Indian Languages) programmes, which encompass 9 languages, including Magadhi and Maithili, under the Bh?sh? AI umbrella of ARTPARK and IISc.

Language models like the GPT-3 can only be trained with enormous computing power and large text samples. For example, Hindi, which has over 40 different dialects, does not contain nearly as much text information.

In addition to the dearth of text data, the fact that Indians predominantly communicate through speech calls for new technological advancements for machines to transcribe, comprehend, or interpret while also accounting for language variances.

Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) need to be revolutionised, and this can only happen through open-source and mission-mode initiatives.

The initiative is now spread in 80 districts across 10 states but plans to include more regions in future. With more than 150,000 hours of curated speech and 100 million phrases of text in Indian scripts, it will expand the size and diversity of India’s open-sourced language data. In parallel, ARTPARK and IISc will announce challenges for researchers and businesses to create software leveraging these datasets in industries including health, agriculture, and financial inclusion.

At the event, Google also presented a variety of new tools designed to make it simpler for Indians to use the internet with the help of artificial intelligence. Google will also integrate the Files app with the government’s Digilocker service in addition to enhancing the security features of Google Pay.  Besides Project Vaani, Google’s multi-search feature, which allows users to search using images and text simultaneously, will now include more Indian languages in the next year, starting with Hindi first. The feature is currently available in English in India. It is also introducing an India-first feature where search results pages will be bilingual for users who prefer this. 

The post ARTPARK-IISc Launches AI-Powered Project Vaani to Make Internet Language-Inclusive appeared first on Analytics India Magazine.