In recent years, large language models (LLMs) have unified a wide range of text-based tasks under a single architecture. One model can code, translate, summarise and generate with remarkable fluidity. This unification, however, has not yet reached the world of speech.

Existing models are either fast but inaccurate or accurate but slow, largely due to high token usage—50 tokens per second—and fixed input padding, which increases computing costs. Speech-focused AI startup Kalpa Labs focuses on creating fast, multilingual, real-time speech-to-text models that resolve the latency and inefficiency issues of systems like Whisper. 

Kalpa Labs aims to reduce audio token rates, eliminate unnecessary padding with configurable “register” tokens and use sparse architectures like mixture-of-experts. This strategy intends to combine the speed of smaller models with the accuracy of larger ones, enhancing real-time transcription and improving user experience across languages.

“If you look at speech models right now, they’re still very fragmented,” Prashant Shishodia, co-founder and CEO of Kalpa Labs, told AIM in an exclusive interaction. “There are specialised models for transcription, text-to-speech, voice cloning, but nothing that can do all of these seamlessly. Our proposition is to take speech models from the 2019 era into the 2025 era of LLMs.”

Key Differentiator

The ambition is clear: build unified speech models that can handle speech-to-text, text-to-speech, voice cloning and even audio editing within a single framework. “Right now, none of the open-source or private models do audio editing. That’s a key capability we’re aiming to unlock,” he added.

Kalpa Labs is solving for both functionality and human-likeness in interaction. Current speech systems, even the most advanced, lack the ability to converse in a natural, duplex manner.

The startup’s vision is to make AI capable of holding long, truly human-like conversations, ones that are responsive, interruptible and sensitive to nuance. “The next frontier is not just expressive voices. It’s about building systems that behave like humans in a real call,” the team noted.

Kalpa Labs has recently pivoted into this space following earlier explorations. “We recently pivoted to this idea, and we’ll be releasing a beta version in the next few weeks,” the co-founder shared. “We are positioning ourselves at the foundational layer.”

YC Acceptance and Global Network

Kalpa Labs has been accepted into the Y Combinator Fall 2025 batch, one of only a handful of India-based teams to make the cut.

“YC has invested heavily in voice AI companies, and we wanted to tap into that network,” Shishodia said. “Many of our potential customers, companies already using ElevenLabs for text-to-speech or Deepgram for transcription, are part of that ecosystem. Accessing them through YC was invaluable.”

However, the process wasn’t without hurdles. From regulatory complexities in India to visa challenges, the team navigated multiple obstacles just to get in. “By the time we receive the YC cheque, the batch will be over,” he noted with a wry smile. “But despite the challenges, it was worth it.”

Scaling the Technology

From the outset, Kalpa Labs is building for scale. All models will be provided via APIs, but the team is also considering open-sourcing edge-capable models.

“There’s no good speech-to-text model that I can run on my Mac today. We want to change that,” Shishodia said. “For sensitive data, people don’t want to rely on cloud services. We’re building edge models with accuracy on par with the largest Whisper models, but runnable locally.”

On the compute side, the approach is pragmatic. “Right now, we have credits to cover us, but to get to state-of-the-art, we’ll need to train extremely large models. The way forward is to train big, then distil down to smaller, more efficient versions.”

Benchmarks and the Illusion of Progress

Speech AI benchmarks have long been a contested measure of progress. “A few years ago, people claimed speech-to-text had surpassed human accuracy. That wasn’t true,” he insists. “Models were overfitting to benchmarks. In the wild, noisy environments, multilingual code-mixing, low-quality microphones, accuracy drops drastically.”

Kalpa Labs is therefore focused on real-world performance, not benchmark gaming. “The whole field needs a refresh in speech benchmarks,” the team argued. “Otherwise we’re fooling ourselves into thinking we’ve solved the problem.”

Tackling Multilingual Complexity

One of Kalpa Labs’ most significant differentiators is its focus on Indian languages and code-mixed speech.

“During tests, models perform well on formal, textbook-style Hindi, but fail on informal, Romanised chat Hindi that people actually use. To fix this, Kalpa Labs is partnering with companies to license large-scale, in-the-wild Indian datasets,” he explained. 

“Without that, building better Indian models is a lost cause,” Shishodia said. “We believe our work can be a step change for Indian languages, and at the same time, competitive at a global level.”

Technical Core: Transformers and Feedback Loops

At the heart of Kalpa’s models are autoregressive transformers, predicting audio chunks sequentially. “We’re experimenting with predicting larger chunks to make the models faster,” Shishodia shared.

The company also emphasises continuous feedback loops. Drawing inspiration from platforms like Midjourney, it wants users’ editing choices to improve the models directly. “Audio editing creates a natural multi-turn feedback cycle. That’s a powerful way to learn what users really want.”

What’s in Store for Kalpa Labs? 

Kalpa Labs is ambitious but realistic about timelines. “AI timelines are short; it’s very hard to predict five years out,” he reflected. “Our immediate goal is to reach state-of-the-art in speech-to-text and text-to-speech within six months, competing with players like ElevenLabs and Deepgram. Beyond that, we want to push the frontier towards truly human-like speech AI.”

Kalpa Labs is chasing a bold vision: to do for speech what GPTs did for text. From unifying fragmented tasks to fixing real-world conversational flaws, from building edge-capable models to solving India’s multilingual complexity, the startup is betting that the next frontier of AI isn’t just about what machines say, but how naturally they say it.

“The next step is not just more expressive voices. It’s making machines talk like humans, in real conversations, in real settings, in real time,” Shishodia summed it up.

The post Why the Next Leap in Speech AI Comes from Kalpa Labs appeared first on Analytics India Magazine.