Soket AI Labs Becomes the First Indian Startup to Build Solutions Towards Ethical AGI

Soket AI Labs, an Indian AI research firm, yesterday announced the launch of ‘Pragna-1B,’ India’s first fully open-source multilingual foundation model. Developed in collaboration with Google Cloud, Pragna-1B aims to enable the adoption of Generative AI in India by providing support for vernacular languages such as Hindi, English, Bengali, and Gujarati.

Abhishek Upperwal, Founder of Soket AI Labs, said, “By leveraging Google cloud, Pragna-1B, despite being trained on fewer parameters, is efficient and compares performance in language processing tasks to similar category models.”

He further added, “Tailored specifically for vernacular languages, Pragna-1B offers balanced language representation and enables faster and more efficient tokenization suited for organisations seeking optimised operations and enhanced functionality.”

The model has been designed specifically with Indian contexts in mind, ensuring transparency and clarity for enterprises integrating AI into their operations. Soket AI Labs leveraged Google Cloud’s AI infrastructure to achieve efficiency and cost-effectiveness in the development of Pragna-1B.

Soket AI Labs and Google Cloud plan to deepen their collaboration further by listing Soket’s AI Developer Platform on the Google Cloud Marketplace and the Pragna series of models on the Google Vertex AI model registry. 

This integration will provide developers with a streamlined experience for fine-tuning models, combining Soket’s intuitive interface with the high-performance resources of Vertex AI and TPUs.

The story so far

Soket AI Labs, founded by Abhishek Upperwal in 2019, created ‘Bhasha,’ a series of high-quality datasets designed for training Indian language models. This includes ‘Bhasha-wiki,’ which consists of 44.1 million articles translated from English Wikipedia into six Indian languages, and “Bhasha-wiki-indic,” a refined subset focusing on content relevant to India. 

Pragna-1B, features a Transformer Decoder-only architecture with 1.25 billion parameters and a context length of 2048 tokens. Trained on approximately 150 billion tokens, with a focus on Hindi, Bangla, and Gujarati, Pragna-1B delivers state-of-the-art performance for vernacular languages in a small form factor.

In a recent LinkedIn post, Upperwal highlighted the improvements in GPT-4o’s tokenizer and vocabulary size, which now supports 200k tokens. However, he noted that Pragna-1b’s tokenizer still outperforms GPT-4o when it comes to Kannada, Gujarati, Tamil, and Urdu, serving as a motivation for Soket AI Labs to improve support for Hindi and other Indian languages.

Soket AI Labs is also experimenting with a Mixture of Experts model, expanding the languages supported and exploring different architectures for increased optimization. 

The post Soket AI Labs along with Google Cloud Launches India’s First Open-Source Multilingual Foundation Model appeared first on Analytics India Magazine.