HM.AI: Mistral AI releases Voxtral Transcribe 2: a new benchmark for voice transcription and speaker separation

On February 5, 2026, Mistral AI officially launched Voxtral Transcribe 2, which includes two models: Mini V2 and Realtime. The new model supports speaker separation and has sub-200ms ultra-low latency, specifically designed for real-time applications. The tweet has received nearly 300000 views within hours of its release, sparking widespread community attention.

Official release of core points

The Voxtral Transcribe 2 series released this time marks a significant leap forward for Mistral AI in the field of audio processing. According to the official announcement, this series is not a single model, but a model family with the dual capabilities of "batch processing" and "real-time streaming", which aims to solve the delay pain and privacy concerns in enterprise level voice workflow. Mistral AI has clearly taken "edge computing" and "Privacy First" as its core differentiation strategy this time, trying to open a new channel in the market dominated by OpenAI Whisper.

Model Family and Positioning

Voxtral Transcribe 2 includes two models optimized for different scenarios:

Voxtral Mini Transcribe V2: Designed specifically for handling pre recorded audio, emphasizing ultimate accuracy and cost-effectiveness. It provides industry-leading transcription accuracy while maintaining low costs, suitable for non real time scenarios such as conference archiving and media subtitle generation.

Voxtral Realtime: This is the highlight of this release, designed specifically for live streaming, voice assistants, and real-time translation. Its core architecture supports streaming input, and the latency can be configured to sub-200ms (less than 200 milliseconds), which is close to the natural response threshold of human conversation and completely changes the lagging experience of "recording first and then transferring" in the past.Enterprise level key capabilities

In response to common pain points in enterprise implementation, Mistral AI has integrated multiple advanced functions into the new model, enabling users to achieve a complete workflow without relying on third-party plugins

Speaker Diarization: The model can accurately identify "who is speaking" and generate transcribed text with speaker labels. This is crucial for meeting minutes and multi person interview analysis, solving the problem of traditional models only being able to output a "lump of text" that cannot distinguish speakers.

Context Biasing: Users can pass in up to 100 specific words or phrases (such as names, terms, product codes) through the API. The model will prioritize matching these contents when transcribing, effectively solving the problem of general models easily mishearing rare words in vertical fields.

Word level timestamps: provide precise start and end times for each word, providing a data foundation for automatic video subtitle alignment and audio content search.

Ultra long audio support: A single request can handle up to 3 hours of audio files, enough to cover the vast majority of marathon meetings or lengthy interviews, greatly simplifying the sharding logic for developers.

Language Coverage and Privacy Deployment

In terms of language support, Voxtral Transcribe 2 covers 13 major languages, including Chinese, English, French, German, Japanese, Korean, Spanish, Russian, Italian, Portuguese, Dutch, Arabic, and Hindi. This native multilingual capability enables it to directly serve the global business of multinational corporations.

In terms of deployment flexibility, Mistral AI has once again fulfilled its commitment to openness. The Voxtral Realtime model weights have been open sourced under the Apache 2.0 license. This means that developers can not only call APIs, but also download and deploy models on edge devices such as AI PCs, servers, and even high-end mobile devices. This feature of 'data not leaving the local' provides compliance solutions for financial, healthcare, and government scenarios that have extremely high privacy requirements.

In terms of pricing, Mistral continues to maintain a highly competitive strategy:

Mini Transcribe V2 API: priced at $0.003/minute, designed to drive the digitization of large-scale audio data at an extremely low cost.

Realtime API: priced at $0.006/minute, although slightly higher than the bulk version, it is still highly cost-effective in real-time interactive scenarios.

Typical application scenario analysis

Based on the official technical indicators provided, Voxtral Transcribe 2 can unlock the following five core scenarios, significantly improving business efficiency and user experience:

Intelligent meeting minutes and archiving: Using the speaker separation (Diarization) function, the system can automatically generate dialogue records similar to scripts (e.g. "Speaker A: Need to confirm next week's progress Combined with a word level timestamp, users can click on the text to directly jump to the corresponding recording clip, greatly improving the efficiency of post meeting review.

Ultra low latency voice agent: What does Sub-200ms latency mean? In human conversation, the natural pause interval is usually between 200-500ms. Voxtral Realtime compresses transcription delay to this range, enabling AI voice assistants to "listen and answer" like real people, eliminating the awkward "thought silence period" in previous voice interactions, and giving machines the ability to interject and respond quickly.

Real time assistance from the call center: While customer service calls are in progress, the Realtime model can generate real-time text streams and work with the backend LLM to analyze customer emotions, extract key intentions, and recommend scripts to agents in real-time. The ability to analyze while speaking can significantly shorten the average processing time (AHT).

Media live subtitles: Combining multilingual support and low latency features, Voxtral can be used for real-time multilingual subtitle generation for news live broadcasts or sports events. Through the context bias function, television stations can import athlete lists or place names in advance to ensure the professional accuracy of the screening.

Compliance auditing and monitoring: Accurate textual records are the foundation of compliance for financial transactions or law enforcement records. The 3-hour audio support ensures the continuity of complete records, while the on premises deployment capability allows sensitive data to be audited and analyzed without uploading to the cloud.