Soket AI Labs

After selecting Sarvam AI as the first startup, IndiaAI Mission chose Soket AI Labs, Gnani AI, and Gan AI last month as part of the mission to build India’s sovereign AI. 

While Sarvam has already released a few updates and models, others haven’t released anything yet. The companies are yet to receive the promised support from the government (GPUs). 

Soket AI Labs, led by CEO Abhishek Upperwal, is quietly building what could become one of India’s most ambitious AI projects: a 120-billion parameter language model trained on India-centric datasets from scratch under its Project EKA. But the journey to this number is anything but linear.

The company plans to keep it open-source and optimise it for sectors such as defence, healthcare, and education.

“It will take time. It took us a year to get the proposal approved,” Upperwal told AIM. “It won’t be a one-shot deal. We will scale it up to 120 billion parameters and will have to scale it little by little.”  

Upperwal said they plan to make this accessible to all, from researchers to startup founders. The team is building in public under the COOM framework, publishing transparent updates, and committing to energy-efficient, culturally-representative training practices.

What’s the Roadmap?

The team is taking a phased approach, beginning with models as small as 1-2 billion parameters and gradually scaling to 7 billion and then 30 billion. These early iterations will be used to test architecture and data alignment—critical steps before pouring massive compute into the final models. 

Upperwal said that the 7 billion model would most likely be ready in 5-6 months, and the team can scale it to 120 billion within the 10th month.

“We have already done a 1 billion model. So we have an idea. We can scale it to 7 billion,” said Upperwal, speaking about the Pragna-1B model released last year. 

Soket will iterate in stages, not chasing leaderboard scores, but building something reliable from the ground up. 

“I think, from a sovereignty point of view, defence would be an important aspect because defence cannot use DeepSeek. If they go to use DeepSeek they will show Arunachal Pradesh as part of China,” said Upperwal, pointing to the geopolitical risks of using foreign models, especially those from China. 

Besides, security concerns make cloud-based models unsuitable for defence. Soket’s plan is to deploy models in secure, air-gapped environments with on-device capabilities. In education, they are already working with AI CoEs aiming to digitise archives, books, and curriculum content, and to collaborate with ministries and academic institutions.

A New Data Foundation for India

At the heart of Soket’s strategy is an unprecedented data effort focused on Indian languages, which have historically been underrepresented in large AI models. The team is separating the data strategy into two categories: existing and non-existent.

“We are applying OCR to documents. We are applying ASR models on videos and audio. We are extracting content from that.”

The data strategy is intensely India-focused. Pretraining will be done on regional knowledge—government sites, legal records, school curricula—alongside global corpora like scientific papers and code. 

Post-training datasets will cover domain-specific reasoning tasks, including law and agriculture, while evaluation will involve creating new benchmarks for Indian languages and sectors where current tests fall short.

In a recent roadmap blog, Soket AI said, “We’re borrowing DeepSeek’s recipe initially, but we’re modifying it heavily—right down to CUDA kernels and progressive context windows.”

Much of this effort is being conducted in partnership with IIT Gandhinagar, focusing on everything from Indic websites to handwritten PDFs, and even transcribing educational videos. In addition, Soket is generating synthetic data through translation and augmentation strategies, especially for domains like science and mathematics.

Upperwal said that the team will be able to generate 5-6 trillion unique tokens only on Indic languages, including code. In other domains, Soket expects to build a total corpus of 20 trillion tokens—a foundation large enough to train a world-class multilingual model.

“When Common Crawl was done, a lot of Indic websites were not archived… Indic script was ignored.” To correct that, Soket is developing its own data classification systems and language identifiers to preserve and elevate Indic content throughout the training process.

Compute: Scaling in the Cloud, Piece by Piece

Building such a model from scratch demands enormous computational power. Through a government-backed initiative, Soket has requested up to 2,000 GPUs, a mix of NVIDIA H100s and other GPUs. While the government has not allocated GPUs yet,  the startup expects phase-wise access to start early next week.

Although the compute will be cloud-based, Upperwal laughingly said that he hopes to set up local experimentation infrastructure. “We need at least one NVIDIA DGX box that the entire team can share and then start building, optimisation, deploying algorithms, testing, and scale.”

The recent release of the Sarvam-M model drew criticism online for its choice of architecture and perceived performance. But Upperwal believes that bashing is just part of the process. “People will bash. But it’s okay as I have seen technologies, which people don’t typically believe in at the very beginning, become successful later,” Upperwal said.

He lauded Sarvam’s data curation efforts and sees open-sourcing as critical for the community. “It’s not about the model, it’s not even about the downloads… Look at the work.”

At Soket, the team is also bootstrapping synthetic datasets using other models, acknowledging that in low-resource environments, pragmatic reuse of existing models is often necessary. “Say, for example, in our case too, we have been looking at different licensed models to create some synthetic data out of these models. Otherwise, how will you actually do it?”

He compares the progress of Indic AI to voice AI in India, which only took off a year after early efforts began. As more developers understand these models, adoption will follow. Until then, he encourages treating these projects as research accelerators rather than commercial products.

“We want to utilise that model in terms of generating any data or doing translation. That model will not get utilised [fully now],” he said, implying the real value will emerge later.

What’s the Moat of IndiaAI Mission?

Upperwal said even the best global models, including GPT-4, still falter when it comes to authentic Hindi. Soket has seen hallucinations and incorrect grammar in Hindi even from state-of-the-art APIs.

He added that the company wants to fix grammatical and pronunciation mistakes that often appear while conversing, which even the GPT4o is unable to catch. 

This, he argued, is the gap Soket aims to fill with cultural authenticity and dialect nuances. 

Even with Hindi, there are different dialects, and Soket AI wants to incorporate these into the models. “If you want to look at these vernacular-related applications, I think we have to emphasise them,” he noted, adding that Indian AI startups could solve these problems better. 

The post Soket AI’s Plan to Build 7 Bn Open Source Indic LLM Within 6 months appeared first on Analytics India Magazine.