The model weights are available for free on Hugging Face under a CC BY-NC 4.0 license, usable for non-commercial purposes. For commercial use, the Mistral API charges $0.016 per 1,000 characters, approximately 73% cheaper than ElevenLabs Flash v2.5.

Which languages does Voxtral TTS support?

Voxtral TTS natively supports 9 languages: French, English, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model also handles cross-lingual synthesis, meaning it can read text in one language using a voice cloned from another language.

Can you clone a voice with Voxtral TTS?

Yes. Voxtral TTS supports zero-shot and few-shot voice cloning. Just 3 seconds of reference audio is enough for the model to reproduce a voice, capturing accent, inflections, and intonation. This is useful for creating a consistent brand voice or personalizing a voice assistant.

Can you deploy Voxtral TTS on your own servers?

Yes. With 4 billion parameters, Voxtral TTS runs on consumer-grade hardware: a recent laptop with a dedicated GPU, a mid-range desktop GPU, or a modest server. The weights are on Hugging Face. It is one of the few professional-quality TTS models that can be deployed on-premises, making it ideal for companies with data sovereignty requirements.

Is Voxtral TTS suitable for customer support?

Yes. The 70ms latency and multilingual support make it a serious candidate for voice-based customer support assistants. Combined with Voxtral Transcribe for speech-to-text and a Mistral LLM for reasoning, you get a complete voice pipeline that can be hosted entirely in France.

Voxtral: Mistral's Open-Source Voice Synthesis

Q: What is the difference between Voxtral TTS and ElevenLabs?

Voxtral TTS achieves a 68.4% human preference rate against ElevenLabs Flash v2.5, at roughly 73% lower cost. The key difference: Voxtral is open-weight and deployable on-premises; ElevenLabs is exclusively cloud. For companies with data sovereignty requirements, Voxtral is the only viable option at this quality level.

Your business needs voice in its applications: a phone assistant, audio training modules, a voice chatbot for customer support, or simply making your content accessible to people with visual impairments. Until now, the options came down to American cloud services (ElevenLabs, Google TTS, Amazon Polly) or open-source solutions with mediocre quality.

Voxtral TTS changes that. Launched by Mistral AI in March 2026, it is a 4-billion-parameter open-weight voice synthesis model that matches ElevenLabs in quality, supports 9 languages including French, and can run on a simple laptop with a GPU. Here is what this means in practice for a business.

What Voxtral TTS Actually Is

Voxtral TTS is the first text-to-speech model from Mistral AI. It is an autoregressive Transformer model with flow-matching, built on the Ministral 3B base.

In plain terms: you give it text, it produces a natural, expressive voice in 9 languages. And unlike standard cloud services, you can download it and run it on your own machines.

Characteristic	Voxtral TTS
Model size	4 billion parameters
Architecture	Autoregressive Transformer + flow-matching (based on Ministral 3B)
Supported languages	9: French, English, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic
Model latency	70 ms (for 10s of audio, 500 characters)
Real-time factor (RTF)	~9.7x
Max audio length	2 minutes natively, unlimited via API (intelligent interleaving)
Voice cloning	Zero-shot and few-shot (from 3 seconds of reference audio)
License	CC BY-NC 4.0 (open-weight) / Commercial API
API price	$0.016/1,000 characters

What Makes Voxtral TTS Different

Open-weight and deployable on-premises

This is the fundamental point. With 4 billion parameters, Voxtral TTS runs on consumer-grade hardware: a recent laptop with a dedicated GPU, a mid-range desktop GPU, or a modest server. The weights are available on Hugging Face.

For a business, this means: zero voice data leaving your infrastructure. The text you convert to speech — whether it is customer data, internal documents, or confidential content — stays on your machines.

Why this matters

A cloud voice synthesis service receives the full text of whatever you want to vocalize. If that is a client contract, a patient record, or an NDA-covered document, that text transits through third-party servers. With Voxtral running on-premises, the text never leaves your network.

Voice cloning with 3 seconds of audio

Voxtral TTS supports zero-shot voice cloning. Provide a 3-second audio sample and the model reproduces the voice, capturing accent, inflections, intonation, and even natural imperfections.

Concrete use cases:

Consistent brand voice: an executive records 3 seconds, and all of the company's audio content speaks with that voice
Client personalization: a voice assistant that adapts to the profile of the person it is speaking with
Accessibility: converting internal documents to audio with a familiar voice for teams

9 languages with cross-lingual support

Voxtral TTS natively handles French, English, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. Cross-lingual support means you can clone a French voice and have it read English text while preserving the characteristics of the original voice.

For a company that operates internationally, this is a real advantage: one brand voice, multiple languages.

70 ms latency

A latency of 70 ms for a 10-second, 500-character sample places Voxtral TTS in the real-time category. That is fast enough for fluid voice conversations, not just batch audio generation.

5 Concrete Use Cases for Businesses

1. Automated voice customer support

Combine Voxtral TTS with Voxtral Transcribe (Mistral's speech-to-text model) and a Mistral LLM for reasoning, and you get a complete voice pipeline: the customer speaks, the system understands, reasons, and responds with a natural voice. The whole thing can be hosted in France.

The advantage over existing solutions: customer conversations — which often contain personal data, account numbers, and sensitive complaints — never transit through any third-party server.

2. Training and e-learning

Convert written training materials into audio modules, in the trainer's voice or a brand voice. No more booking a recording studio every time a module gets updated. The trainer records 3 seconds, and Voxtral generates the rest.

For companies with frequently changing procedures — manufacturing, construction and building trades, logistics — this is a significant time saver.

3. Accessibility of internal documents

Make procedures, internal memos, or reports accessible to employees with visual impairments or who are on the move. Voxtral TTS can convert any text document to professional-quality audio, in multiple languages, directly from your infrastructure.

4. Embedded voice assistants

With only 4B parameters, Voxtral TTS can run on embedded devices: reception kiosks, industrial equipment, in-vehicle systems. It is one of the few professional-quality TTS models that does not require a cloud connection.

5. Audio marketing content and podcasts

Generate audio versions of your blog articles, newsletters, or product sheets. With voice cloning, the content keeps your brand voice. This is a straightforward way to reach an audience that prefers listening over reading, without investing in traditional audio production.

Want to integrate voice into your business applications?

A free 30-minute diagnostic to identify the right voice use case and the architecture suited to your constraints.

Book a meeting

Pricing and Deployment Options

Voxtral TTS offers two usage modes, depending on your needs and constraints.

Option	Mistral API	Self-hosted (open-weight)
Price	$0.016/1,000 characters	Free (infrastructure cost only)
License	Commercial (included in API)	CC BY-NC 4.0 (non-commercial) or Mistral agreement
Hardware required	None (Mistral cloud)	1 GPU (recent laptop, mid-range GPU, or server)
Data sovereignty	Data in France (Mistral servers)	Data 100% on your infrastructure
Max audio length	Unlimited (automatic interleaving)	2 minutes natively (configurable)
Ideal for	Fast start, variable volumes	Sensitive data, high volumes, full sovereignty

In practice for an SME

Start with the API to validate your use case. At $0.016/1,000 characters, converting a 5,000-character article costs $0.08. If volumes grow or data sovereignty becomes critical, switch to self-hosting. The model is the same in both cases.

Voxtral TTS vs. ElevenLabs vs. Amazon Polly

To position Voxtral TTS in the landscape, here is a comparison on the criteria that matter most for a business.

Criterion	Voxtral TTS (Mistral)	ElevenLabs	Amazon Polly
Voice quality	68.4% human preference vs ElevenLabs Flash v2.5	Market reference	Decent, less natural voices
On-premises deployment	Yes (open-weight)	No (cloud only)	No (AWS cloud)
Voice cloning	Yes (3s reference)	Yes (higher quality)	No
Languages	9 languages	29+ languages	30+ languages
Price	$0.016/1,000 chars	~$0.06/1,000 chars	~$0.004/1,000 chars (standard voices)
Data sovereignty	France / self-hosted	USA (subject to CLOUD Act)	USA (subject to CLOUD Act)
Complete voice ecosystem	Yes (with Voxtral Transcribe + Mistral LLM)	Partial (TTS only)	Partial (AWS integration)

Our analysis: Voxtral TTS is the best quality/price/sovereignty trade-off on the market in 2026. ElevenLabs remains superior on language coverage and high-end voice cloning finesse. Amazon Polly is cheaper but noticeably lower quality. For a company with data sovereignty requirements or cost control priorities, Voxtral is the obvious choice.

The Complete Voice Pipeline with Mistral

Voxtral TTS does not operate in isolation. Mistral AI offers a complete voice ecosystem that lets you build speech-to-speech applications without any third-party dependencies.

Voxtral Transcribe: speech-to-text (transcribes voice to text)
Mistral LLM (Small, Large, or other): understanding, reasoning, response generation
Voxtral TTS: text-to-speech (converts the response back to voice)

This pipeline is integrated into Mistral's Le Chat via voice mode. But you can also deploy it on your own servers to build custom voice assistants connected to your internal data via a self-hosted RAG architecture.

How to Get Started with Voxtral TTS

Try it via Le Chat: the voice mode of Le Chat uses Voxtral TTS. This is the fastest way to assess voice quality
Try the API: create an account at console.mistral.ai, get an API key, and test with a few requests. The cost is negligible for a prototype
Evaluate voice cloning: provide a 3-second sample and compare against your reference voice. Cloning quality depends on sample clarity
For self-hosting: download the weights from Hugging Face and follow the deployment documentation. A GPU with 8 GB of VRAM is enough to get started

Limitations to Know

CC BY-NC 4.0 license for the open-weight version

The open-weight version is licensed for non-commercial use. For commercial on-premises deployment, you need to either sign an agreement with Mistral AI or use the paid API. This is not Apache 2.0 like Mistral's text models.

9 languages, not 30

If your business operates in Asian markets (Chinese, Japanese, Korean) or languages not covered by the model, Voxtral TTS will not be sufficient for now. ElevenLabs covers a much broader spectrum.

Variable cloning quality

Zero-shot cloning with 3 seconds works well for a clear reference voice with no background noise. In less ideal conditions (phone recording, ambient noise), quality degrades. For the best results, plan for a clean recording.

No fine-grained emotion control

Voxtral TTS captures natural expressiveness but does not let you precisely control emotions (joy, sadness, urgency) the way some commercial models do. The model reproduces the tone of the reference voice; it does not modify it on command.

Frequently Asked Questions

The weights are available for free on Hugging Face under the CC BY-NC 4.0 license (non-commercial). For commercial use, the Mistral API charges $0.016 per 1,000 characters, approximately 73% cheaper than ElevenLabs Flash v2.5.

9 languages: French, English, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. Cross-lingual support lets you clone a voice in one language and read text in another.

Yes. 3 seconds of audio is enough to reproduce a voice with its accents, inflections, and intonations. Quality depends on the clarity of the reference sample. A clean recording with no background noise gives the best results.

Yes. With 4 billion parameters, the model runs on consumer-grade hardware (8 GB VRAM minimum). It is one of the few professional-quality TTS models deployable on-premises, ideal for companies with data sovereignty requirements.

Yes. The 70 ms latency and multilingual support make it a serious candidate for voice assistants. Combined with Voxtral Transcribe and a Mistral LLM, you get a complete voice pipeline that can be hosted entirely in France.

Voxtral TTS achieves a 68.4% human preference rate against ElevenLabs Flash v2.5, at roughly 73% lower cost. The key difference: Voxtral is open-weight and deployable on-premises; ElevenLabs is exclusively cloud. For data sovereignty, Voxtral is the only option of this quality.

Go Further

Le Chat by Mistral, the French AI assistant: discover the interface that integrates Voxtral TTS in voice mode
Self-hosted RAG with Mistral: build a voice assistant connected to your internal data
Fine-tuning Mistral on your data: adapting Mistral models to your specific business context
Deploying an LLM to production: infrastructure and best practices for self-hosting

Integrate AI voice into your business

Voxtral TTS opens professional-quality voice synthesis to SMEs. Integrating it into your business applications with your own data is our specialty.

Book a Free AI Audit

Voxtral TTS, Mistral's Open-Source Voice Synthesis for Business Applications