India’s AI Push Might Be Pointless Without National Language Standardisation
India’s AI ambitions will be incomplete without incorporating its vernacular languages in the ecosystem for the benefit of its vast population. And this endeavour still has an old problem to tackle – the lack of real world data on Indic languages.
Most of us may have struggled with copying text from a PDF in an Indic language like Hindi, Kannada, or Telugu, as it gets pasted as boxes and symbols. The problem may seem related to fonts, but it runs deeper than that, dating back to 1988.
The foundation itself is broken, according to Vivekanand Pani, co-founder and CTO of Reverie Language Technologies, who argued that the effort and investments in sovereign models and data collection would see limitations without fixing Indian language computing standards.
“No engagement, no data. No data, no AI,” he told AIM.
Internet and Indic Languages
Indian languages were flourishing on computers before the internet gained popularity. Voter ID cards, land records, railway charts — all of it was happening in local languages using robust national standards like ISCII, which was created in 1988.
C-DAC trained people through PACE programs across India. Desktop publishing in Indian languages became an industry. “The standards were robust, intuitive, and non-ambiguous. That’s why adoption grew,” Pani recalled.
But the moment the internet took over, everything broke. The web adopted Unicode. Microsoft introduced half-baked Hindi support in Windows 2000 with Unicode and OpenType fonts, ignoring Indian standards. Other languages lagged. Ambiguities crept in — the same word written in one device looked different on another.
Search didn’t work. Documents stopped being interoperable. People started relying on PDFs, which froze text into unsearchable images. Engagement collapsed.
“Internet usage in India was already picking up in cyber cafes in the 2000s. People got used to it in English. But in Indian languages, the experience was broken… the new generation never discovered how to type or search in their language. So engagement never happened,” Pani said.
Tech Trouble or Policy Problem?
As India bets big on AI, it is scanning and generating a lot of data in Indic languages for the AI models to get better. This synthetic data many believe is the magical path forward for Indic models to get better. But, how long can companies keep doing that and ignore the old, rich, and valuable data in Indian libraries and old PDFs?
Pani said the IndiaAI Mission, the sovereign models, and the massive data collection drives are all built on shaky ground. Without fixing the basics of language standards, he added, all of it is “money wasted on synthetic, stale, and unnatural data.”
Read: Sovereign LLMs Won’t Alone Fix the Broken Indian AI Ecosystem
Pani, along with cofounder Arvind Pani, and colleague Swati Shukla Bhaskar, took the issue to the Prime Minister’s Office, where they explained how Indian computing standards are controlled by global companies like Microsoft and the Unicode consortium, which were never designed for the complexities of Indian scripts. The PM asked them to present to his office, which they did.
Meetings were arranged with the principal scientific advisor, the IT secretary, and the DST secretary. A committee was formed under Pushpak Bhattacharyya in May 2024 to evaluate encoding, fonts, search performance, and language technology applications for all 23 official languages of India.
This is exactly what Pani had wished would happen all this while. But one and a half years later, nothing has moved and “we don’t have a fixed character set for Hindi,” he said, adding, “the committees are the same old people defending the same broken standards that were created.”
When AIM reached out to Bhattacharyya for comments, he said that he is no longer the chair of the committee. Though there has been no formal announcement of the change, he said that Abhay Karandikar is now the chairman, who is also the secretary to the government of India in the Department of Science and Technology.
AIM also reached out to Karandikar for more details about the committee and the progress but did not receive any response at the time of writing this article.
Pani said he has watched committees drag their feet on fixing standards. “We met PMO officials. They understood immediately. They even asked us: if this is so obvious, why doesn’t everyone say it? We told them, everyone knows it, but they think it’s beyond their power. It’s not a technology problem, it’s a policy problem.”
Engagement Over Synthetic Data
Pani said that people believe that “the reason we don’t have Indian language models is because there is no data. So they say let’s collect data. But the real question is, why is there no data? India has one of the largest internet user bases, but one of the least engaging ones. If people don’t write, search, and express in their languages, where will the data come from?”
Instead, India has gone down the synthetic data route. “Synthetic data is stale and expensive. The day you stop paying, it goes stale. Natural data comes from engagement. If engagement doesn’t happen, this will never be sustainable,” he said.
That absence of engagement is the root cause of India’s AI data problem.
India’s most valuable language data — publishing, DTP, government documents — is locked in PDFs, unusable for modern AI as they were scanned in ISCII.
The core issue is that Indian language standardisation was never democratically decided, said Pani. “When Unicode was adopted, nobody questioned it. Whatever version they picked up, it was implemented for the country. Unicode never consulted anyone in India.”
Ironically, the new committees set up to “fix” standards are highly democratic — but not in a way that helps. “They brought in linguists, Microsoft, Samsung, Adobe, everyone under the sun. In the meetings it was like Reverie on one side, and the rest of the world on the other,” Pani said.
The frustration is evident. “People defend broken standards, give strange answers like ‘AI needs noise’. But this noise is not human noise, it’s noise created by flawed standards. And because of it, we have lost three decades of engagement,” he added.
For Pani, the answer is clear: fix the standards, teach them in schools. Use the same standards across devices. Only then will Indians engage in their languages online. Otherwise, all our sovereign AI dreams are nothing but synthetic data dressed up as progress.
The post India’s AI Push Might Be Pointless Without National Language Standardisation appeared first on Analytics India Magazine.



