Xiaomi OmniVoice open-source TTS model for over 600 languages, outperforms ElevenLabs in Chinese accuracy?

To be honest, I was a bit surprised when I saw this data - Xiaomi has already achieved this level in the field of speech synthesis.

OmniVoice， This is a project that Xiaomi's next-generation Kaldi team (k2-fsa) has just opened sourced. A zero sample text to speech model that supports over 600 languages and achieves state-of-the-art performance in multiple metrics. More importantly, it is completely open-source, with both the code and pre trained models available on GitHub and Hugging Face.

How strong is the accuracy of Chinese recognition?

First, let's take a look at a set of data: on the Seed TTS Chinese test set, OmniVoice's word error rate (WER) is only 0.84%.

What is the concept of this number? ElevenLabs v2 and MiniMax, mainstream commercial models, have been surpassed by it on multilingual benchmarks. The similarity (SIM-o) and WER metrics are both leading.

To be honest, ElevenLabs has always been a benchmark in the field of speech synthesis. Xiaomi's open-source project can surpass it in terms of metrics, which is indeed something.

What is the experience of being 40 times faster?

The real-time factor (RTF) of OmniVoice is as low as 0.025.Simply put, RTF=1 means that the synthesis speed is equal to real-time playback. RTF = 0.025， This means that the synthesis speed is 40 times faster than real-time.

That is to say, a 1-minute speech can be synthesized in just 1.5 seconds. This is of great value for scenarios that require a large amount of voice generation, such as audiobooks, voice assistants, and game dubbing.

What are the differences in technical architecture?

OmniVoice uses a discrete non autoregressive architecture with a diffusion language model style.

The core advantage of this design is that it can directly generate speech from text in one step, bypassing the traditional intermediate semantic token stage. The process has been simplified, but the quality has not been compromised.

In addition, it uses a full codebook random masking strategy combined with pre trained LLM initialization. These two technical points make training more efficient, and the output clarity and comprehensibility are also better.

3 seconds of audio can clone your voice

This is the most interesting feature I think: zero sample speech cloning.

Only 3-10 seconds of reference audio is needed to clone a high-quality sound. And it can also be customized through natural language description - gender, age, tone, accent, dialect, and even generate whisper style.

Imagine uploading a 5-second recording and telling it to 'use this voice, but younger, with a bit of southern accent', and it can generate a sound that meets the requirements. This is too playable.

Digital protection of minority and endangered languages

OmniVoice covers over 600 languages, which is its biggest highlight.

For minority and endangered languages, this technology holds significant importance. Traditional speech synthesis requires a large amount of annotated data and has extremely high costs. But OmniVoice only requires a small number of samples to generate high-quality speech.

This means that languages spoken by only a few thousand people also have the opportunity to be digitized and preserved. Not only technological breakthroughs, but also contributions to cultural preservation.

There are also some practical small functions

Support nonverbal symbols, such as [laugh] representing laughter

Support pronunciation correction: precise control of pronunciation through pinyin or phonetic symbols, especially suitable for Chinese and dialects

These detailed features make it not just 'able to speak', but also 'able to speak well' and 'accurately'.

What does open source mean?

The code and model are both on GitHub and Hugging Face, and developers can deploy them locally or integrate them into their own applications.

For the field of speech synthesis, an open-source model that reaches SOTA level will greatly lower the technical threshold. Small teams and individual developers no longer need to spend a lot of money on commercial APIs, they can deploy and use them themselves.

This may give rise to a new wave of voice applications. Audiobooks, virtual anchors, game voice acting, language learning tools... there is a lot of room for imagination.

HM.AI

Xiaomi OmniVoice open-source TTS model for over 600 languages, outperforms ElevenLabs in Chinese accuracy?

HM.AI