Microsoft's open-source speech model VibeVoice has become popular: GitHub has 27K stars per day and can handle 90 minutes of long audio

Microsoft has really come up with a big move this time.

A few days ago, Microsoft opened sourced a family of speech AI models called VibeVoice, covering speech recognition (ASR) and text to speech (TTS). The project exploded as soon as it was uploaded to GitHub, quickly gaining 27K stars, which is definitely the top heat among AI open source projects.

Why is it so popular? Because this thing has indeed solved several tough problems of voice AI.

Three models each have unique skills, covering the entire scene

VibeVoice is not a single model, but a family. Three core members perform their own duties:

VibeVoice ASR-7B: speech to text, capable of processing 60 minutes of audio at once. Output is not just text, it also tells you "who is speaking", "when it was said", "what was said", structured output that can be used directly. Supports over 50 languages, perfectly adapted for long audio scenarios such as meeting minutes and podcast transcription.

VibeVoice TTS-1.5B: Text to speech, capable of generating 90 minutes of continuous audio at once. The most impressive feature is the ability to support natural conversations between four different speakers, as well as simulate pauses, emphasis, and emotional transitions. Creating podcasts, audiobooks, and multi character dialogue content is simply a magical tool.

VibeVoice Realtime-0.5B: Real time TTS, with a first audio output delay of only 300 milliseconds. Real time voice assistants and live dubbing are perfect for scenarios that require immediate response.

These three models combined basically cover the mainstream needs of voice AI. Long audio processing, multi speaker consistency, and real-time low latency - these three pain points have not been solved by traditional voice AI, and VibeVoice can be considered as the answer.

MIT protocol open source, free local deployment

This may be the most exciting part for developers.

VibeVoice adopts the MIT license agreement, supports local deployment, and does not require cloud subscription fees. This means that you can run entirely on your own server without worrying about API call fees, data leakage, or service interruptions.

For enterprise users, this is too important. Voice data often involves sensitive information - meeting content, customer conversations, internal communication, and there are always concerns when it is transmitted to third-party cloud services. Local deployment perfectly solves this problem, with data completely under one's control.

Moreover, the MIT protocol is one of the most relaxed open source protocols, and there is no problem with commercial use. Microsoft is truly 'open source' this time, not the kind of 'open source code to see'.

It was briefly taken down, but later added a safety mechanism

There is a small incident: the project was briefly taken down due to potential misuse risks. This can also be understood that speech synthesis technology does have the risk of being abused - forging speech, creating fake audio, and so on.

Later, Microsoft went back online through security mechanisms such as embedding audio watermarks and audible disclaimers. This reflects the principle of responsible AI development, which balances open source and risk considerations, and is done well.

Developers can now obtain model weights on GitHub and Hugging Face, and can also quickly try them out through Colab. The community is also actively contributing, for example, the optimization fork for Apple Silicon has been released, making it more convenient for Mac users to use.

The developer has already developed a practical tool

The biggest benefit of open source is that the community will help you expand. Developers have already developed a voice input method called Vibing based on VibeVoice ASR-7B, which supports macOS and Windows.

User feedback states that the recognition speed and accuracy are both good, and the efficiency of daily voice input has significantly improved. The distance between the model and the application has been greatly shortened by open source.

This actually illustrates a problem: good open source projects are not only technically strong, but also user-friendly. VibeVoice can be quickly turned into a practical tool by developers, indicating that its interface design and documentation completeness are well done.

What does this mean? The threshold for voice AI has been significantly lowered

Firstly, high-performance voice AI is no longer exclusive to giants. Previously, I had to do long audio processing and multi speaker conversations either by starting from scratch (at a high cost) or by calling cloud service APIs (at a considerable cost). Now open source solutions are available here, allowing small and medium-sized teams as well as individual developers to use top-notch technology.

Secondly, local deployment becomes possible. For enterprises with data security requirements, this is a significant benefit. Don't worry about whether to transfer voice data to a third party anymore.

Thirdly, voice AI applications will accelerate their implementation. The innovation threshold for scenarios such as content creation, accessibility tools, voice interaction, and meeting recording has been lowered, and more VibeVoice based applications will emerge in the future.

What will happen in the future? The 'Stable Diffusion Moment' of Voice AI

If we analogize, VibeVoice may become the "Stable Diffusion" in the field of voice AI - open source, powerful, and explosive to the community.

After the open source of Stable Diffusion, image generation applications exploded in popularity. If VibeVoice takes the same path, voice AI applications will also experience a wave of explosion. Podcast production tools, meeting assistants, multilingual translation, audiobook generation, voice games... there is a lot of room for imagination.

Of course, voice AI is more sensitive than image generation and carries a higher risk of abuse. Microsoft has added watermarks and disclaimers, but it remains to be seen how the community uses them and whether anyone does anything wrong. However, overall, the benefits of open source outweigh the drawbacks, and technological progress and risk management can go hand in hand.

HM.AI

Microsoft's open-source speech model VibeVoice has become popular: GitHub has 27K stars per day and can handle 90 minutes of long audio

HM.AI