Vodex Research – Voice AI Innovation & Model Development

Who We Are

Vodex is an AI-native voice automation company building the next generation of speech intelligence for real-world, high-stakes conversations especially in industries like mortgage, insurance and collections. At our core, we're a research-driven team of engineers, scientists, and builders dedicated to solving one of the hardest problems in AI: making machines talk, sound, and think like humans over telephony infrastructure.

Our founders come with decades of experience in AI/ML, speech synthesis, and NLP, having worked with foundational technologies long before they were mainstream. From early experiments in 2020 using NVIDIA Tacotron 2 to developing fallback NLU stacks based on FAISS and Haystack-RAG with BLOOM in 2021, Vodex has always stood at the edge of what’s possible in voice automation.

Why We Built Our Own TTS

Even though Vodex was never intended to be a TTS provider, we built our own Text-to-Speech model out of necessity, not ambition.

Why? Because existing TTS solutions lacked naturalness, expressiveness, and most critically—control. In real sales or support conversations, the ability to laugh, sigh, pause, or express uncertainty isn’t optional, it’s essential. Off-the-shelf models often sounded robotic or collapsed entirely in narrowband settings (e.g., 8kHztelephony), making them unusable for production-grade, agentic AI systems.

So, we went back to the drawing board.

Our pretrained TTS model is based on the Orpheus core architecture, trained from scratch on over 21,000 hours of diverse and expressive data, but unlike conventional implementations, we integrated our Zen-Tokenizer at the audio tokenization layer. This gives us superior control, expressiveness, and compatibility with both narrowband (8kHz) and wideband(16kHz) voice pipelines.

Voices That Feel Human

Our current model supports three distinctive speakers:

Shreya

Shweta

Astha

Shreya – The most soft-spoken and emotionally expressive voice in our benchmarks. Ideal for empathetic and human-like agentic use cases.

Shweta – Trained to sound like a professional audiobook narrator, offering smooth, composed, and clear storytelling tone.

‍Astha – A latency-optimized voice that preserves expressiveness while minimizing response time, developed as an internal experiment to reduce overhead without sacrificing naturalness.

‍

We’ve achieved a Time-to-First-Byte(TTFB) of just 189ms on average, making our model one of the fastest expressive TTS systems available for real-time deployment. We're actively refining the codebook structure within Zen-Tokenizer and testing distributed deployments techniques to bring that latency down further, targeting sub-80ms response times in production environments.

Introducing
Zen-Tokenizer
A Breakthrough in Audio Tokenization

Today, we're excited to announce a major milestone: the Zen-Tokenizer, our proprietary Neural Audio Codec tokenizer designed from the ground up for speech-to-speech agent architectures.

Zen-Tokenizer can preserve up to 92%perceptual audio quality even when operating at 8kHz, all while respecting the constraints of the Nyquist-Shannon sampling theorem, a fundamental limit that has traditionally hampered expressive audio synthesis at low bandwidths.

Built for both narrowband (8kHz) and wideband (16kHz) audio, Zen-Tokenizer is optimized for telephony-grade deployments, unlike most tokenizers built for clean, high-fidelity lab environments. This makes it a foundational building block for the agentic voicestack of the future.

What’s Next

We’ve watched and learned from pioneers like Kyutai (Moshi’s full-duplex and MIMI tokenizer work), Snac’s neural codecs, Canaophy Labs, and Sesame Labs with their CSM stack or papers like Carson et al. (2025) etc. These teams inspire us, but we also realized something critical: existing research pipelines are too slow for truly agentic, voice-first AI systems.

So instead of waiting, we started building.

In early internal experiments, we’ve had success with a multimodal architecture, one that takes text or speech as input and generates both text and speech as output. This is not just exciting, it’s foundational to the future of human-AI interaction.

As part of our commitment to the open research community, we’re preparing to open source our pretrained TTS model(note: not finetuned) that was built with extensive enhancements in training, generation, and token control using Zen-Tokenizer.

We plan to release this model in the coming weeks. If you're interested in early access or want to be notified, we invite you to request it by clicking below

Request access

Join the Journey

Agentic AI in voice isn’t just a buzzword for us, it’s a long-term mission. If you're building, researching, or dreaming in this space, come collaborate with us. At Vodex, we believe the next interface isn't a screen. It’s a voice. Feel free to reach out to us at contact@vodex.ai

Vodex Research & Innovation

Pushing the Boundaries of Voice AI

Who We Are

Why We Built Our Own TTS

Voices That Feel Human

Introducing
Zen-Tokenizer
A Breakthrough in Audio Tokenization

What’s Next

Join the Journey

Vodex Research & Innovation

Pushing the Boundaries of Voice AI

Who We Are

Why We Built Our Own TTS

Voices That Feel Human

Introducing Zen-Tokenizer A Breakthrough in Audio Tokenization

What’s Next

Join the Journey

Introducing
Zen-Tokenizer
A Breakthrough in Audio Tokenization