SeamlessM4T: Advancing Multilingual Speech Translation

Researchers from Meta AI recently proposed SeamlessM4T—a single model supporting speech-to-speech (S2S), text-to-speech (T2S), and text-to-text translation (T2TT) for 100 languages. Leveraging vast audio data and self-supervised speech representations, SeamlessM4T outperforms prior models.

Study: SeamlessM4T: Revolutionizing Multilingual Speech Translation. Image credit: metamorworks/Shutterstock
Study: SeamlessM4T: Revolutionizing Multilingual Speech Translation. Image credit: metamorworks/Shutterstock


Creating the Babel Fish, a universal speech translation tool still remains a challenging endeavor for scientists. While text-based models have expanded translation capabilities, unified speech-to-speech models lag. Conventional systems rely on multiple subsystems, hindering scalability. To bridge this gap, researchers have developed SeamlessM4T, which achieves remarkable results in speech-to-text translation (S2TT), improving quality and safety and being open-sourced for further advancement.

Prioritizing speech in machine translation

Machine translation (MT) has primarily focused on text due to its ease of handling and abundance. However, speech is distinct, with unique grammar, registers, and expressive qualities. It fosters stronger social bonds compared to text-based communication. Current speech translation models have limitations. Cascaded systems chain various subsystems, while direct S2T models have language coverage issues. AudioPaLM stands as the current state-of-the-art, bridging the gap between text and speech translation.

The current study aims to create a unified large model capable of handling speech and text translation tasks, expand language coverage, and ensure systematic evaluations for safe and equitable performance. It seeks to bridge the translation gap between high- and low-resource languages, making translation technology accessible to all.

SeamlessAlign: Automatically creating aligned data for speech

Creating an effective multilingual and multimodal translation system such as SeamlessM4T demands substantial resources spanning multiple languages and modalities. While some human-annotated translation resources are freely accessible, they often cover a limited set of languages or specific domains. Collections involving the speech modality, such as a diverse, multilingual S2TT corpus (CoVoST) and multilingual TEDx, also exist. However, there is currently no open dataset that matches the scale of initiatives such as Whisper, which have shown exceptional performance.

To address this, the parallel data mining technique emerges as an alternative to closed data, offering broader language coverage and larger corpus sizes. The prevailing approach involves encoding sentences from various languages and modalities into a shared fixed-size embedding space (Sonar), identifying parallel instances based on similarity metrics, and performing mining through pairwise comparisons on extensive monolingual corpora.

This method, initially introduced with the multilingual laser space, has been scaled to 200 languages and the speech modality through teacher-student training. The dataset SeamlessAlign, was generated using the parallel data mining technique. This dataset is the most extensive open dataset for multimodal translation to date, totaling 470,000 hours. Researchers introduced several enhancements, including improving the speech language identification (LID) model, increased language coverage, and a substantial increase in raw audio data.

SeamlessM4T models

Significant advancements in direct S2TT models have been witnessed recently. These models have achieved parity with cascaded models in specific scenarios, such as constrained data, in-domain settings, and specific language pairs. However, the landscape has evolved with the emergence of massively multilingual translation models and weakly supervised automatic speech recognition (ASR) models. This shift renders previous comparisons obsolete, highlighting the significant lag of direct models compared to robust cascaded models.

SeamlessM4T aims to make direct and cascaded models more similar in the context of translating speech to text in various languages and formats. It endeavors to achieve this by constructing a potent direct text and speech-into-text (X2T) model proficient in translating both speech and text into text. This model combines a robust speech representation learning model with an immensely multilingual T2TT model.

Additionally, SeamlessM4T explores S2TT with UnitY, a two-step model. First, it creates text, and then it predicts sound units. Unlike other models that use separate components, UnitY's parts can work together, fixing problems and differences between them. It uses a middle-level meaning representation to handle different sources and targets. The vocoders used for speech synthesis are trained separately.

The SeamlessM4T model consists of four core building blocks: a massively multilingual T2TT model (SeamlessM4T-NLLB), a speech representation learning model utilizing unlabeled speech audio data (w2v-BERT 2.0), a text-to-unit sequence-to-sequence model (T2U), and a multilingual HiFi-GAN unit vocoder for speech synthesis from units.

To achieve its objectives, SeamlessM4T employs multi-task UnitY models that integrate components from the first three building blocks. These models undergo fine-tuning in three stages, starting with an X2T model with an English target and culminating in a versatile multitask UnitY system capable of T2TT, S2TT, S2ST, and ASR tasks.

The model description covers different stages of the SeamlessM4T architecture, including initial training with w2v-BERT 2.0, creating text, preparing data for speech-to-speech translation, training for text-to-unit conversion, and the final fine-tuning stage. SeamlessM4T's performance is evaluated using standard automatic metrics and compared to state-of-the-art speech translation models, showcasing its strengths in various translation tasks.

To evaluate the model, researchers deployed Blaser 2.0, a versatile metric accommodating both speech and text. Human assessments are centered on retaining speaker intent and audio quality. The model showed superior robustness, even when dealing with background noise and speaker variations, as evidenced by the BLEU-SNR and WER-SNR curves.

Responsible AI

In line with responsible system development, the focus is on evaluating added toxicity and bias, crucial for safe system deployment. Fair translation outputs, devoid of bias, are essential. Toxicity analysis utilizes a new metric, ASR-ETOX, while gender bias assessment is based on masculine and feminine references. Results show variations across languages and datasets. Additionally, the study investigates gender representation in datasets, finding an overrepresentation of masculine terms. However, this approach has limitations, including the reliance on word lists and linguistic gender cues for bias detection.


In summary, the model SeamlessM4T addresses the limitations of existing speech translation systems. It supports ASR, T2TT, S2TT, T2ST, and S2ST for multiple languages. Developed using extensive audio data and self-supervised speech representations, SeamlessM4T outperforms previous models in various translation tasks. It excels in S2T, T2S, and more, with an open-source approach for further advancement. Additionally, SeamlessM4T demonstrates reduced toxicity and improved robustness, marking significant progress in responsible AI.

Journal reference:
Dr. Sampath Lonka

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Lonka, Sampath. (2023, September 22). SeamlessM4T: Advancing Multilingual Speech Translation. AZoAi. Retrieved on May 27, 2024 from

  • MLA

    Lonka, Sampath. "SeamlessM4T: Advancing Multilingual Speech Translation". AZoAi. 27 May 2024. <>.

  • Chicago

    Lonka, Sampath. "SeamlessM4T: Advancing Multilingual Speech Translation". AZoAi. (accessed May 27, 2024).

  • Harvard

    Lonka, Sampath. 2023. SeamlessM4T: Advancing Multilingual Speech Translation. AZoAi, viewed 27 May 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
PHEME: Transforming Speech Synthesis with Efficiency and Quality