Efficient Monotonic Multihead Attention (EMMA) for Real-time Speech-to-Text Translation

In an article recently posted to the Meta Research website, researchers introduced Efficient Monotonic Multihead Attention (EMMA), a cutting-edge model for simultaneous translation with stable monotonic alignment estimation. The focus was on reducing machine translation latency, which is crucial for real-time applications like international conferences.

Study: Efficient Monotonic Multihead Attention (EMMA) for Real-time Speech-to-Text Translation. Image credit: Antonio Guillem/Shutterstock
Study: Efficient Monotonic Multihead Attention (EMMA) for Real-time Speech-to-Text Translation. Image credit: Antonio Guillem/Shutterstock


Simultaneous translation, a task aimed at minimizing machine translation system latency, is crucial for real-time applications like international conferences and personal travels. Unlike traditional offline models that process an entire input sentence before generating output, simultaneous models operate on partial input sequences. These models incorporate policies to decide when to generate translation output, involving actions such as reading and writing. Monotonic attention-based policies, particularly Transformer-based Monotonic Multihead Attention (MMA), have excelled in text-to-text translation tasks.

In simultaneous translation, the model initiates translation before the speaker completes the sentence. Monotonic attention models rely on learned policies for alignment estimation during training, making them well-suited for these scenarios. However, when applied to speech input, MMA faces challenges highlighted by suboptimal performance. Issues arise from numerical instability, bias in alignment estimation, and significant variance in alignment, particularly in later parts of sentences, due to the continuous nature of encoder states.

To address these challenges, the present paper introduced EMMA, offering a novel, numerically stable, and unbiased monotonic alignment estimation, proving effective in both simultaneous text-to-text and speech-to-text translation tasks. The model also introduced strategies for reducing monotonic alignment variance and included regularization of latency. Furthermore, the training scheme was enhanced by fine-tuning from a pre-trained offline model.


Researchers broke down the model into three key aspects. EMMA model is discussed in numerically stable estimation, alignment shaping, and simultaneous fine-tuning. EMMA's monotonic alignment estimation, denoted as α, was based on a single attention head, with the same estimation applied to every attention head in the Transformer-based MMA. The infinite lookback variant of monotonic attention was emphasized.

  • Numerically Stable Estimation:
    EMMA addressed numerical instability in alignment estimation by introducing an innovative numerically stable approach. The closed-form estimation involved a transition matrix, ensuring stability and unbiasedness without the need for a problematic denominator.
  • Alignment Shaping:
    Latency regularization was introduced to prevent the model from learning a trivial policy during training. Expected delays were estimated from the alignment, and a latency regularization term was added to the loss function. Additionally, an alignment variance reduction strategy was proposed, introducing an enhanced stepwise probability network and a variance loss term in the objective function.
  • Simultaneous Fine-tuning:
    Simultaneous fine-tuning was introduced as a method to enhance adaptability and leverage recent advancements in large foundational translation models. This involved initializing the offline encoder-decoder model and optimizing only the decoder and policy network during training, assuming that the generative components closely resembled those of the offline model.
  • Streaming Inference:
    For streaming speech input, the inference pipeline used SimulEval, updating the encoder with each new speech chunk and running the decoder to generate partial text translations based on the policy. This streaming inference process ensured real-time translation for applications like simultaneous speech-to-text translation.

EMMA offered numerically stable alignment estimation, introduced strategies for alignment shaping and simultaneous fine-tuning, and facilitated streaming inference for real-time applications.

Experimental Setup

The proposed models for speech-to-text translation were evaluated using the SimulEval toolkit, focusing on quality (detokenized BiLingual Evaluation Understudy (BLEU)) and latency (Average Lagging). The simultaneous fine-tuning strategy was followed, initializing the simultaneous model from an offline translation model. Two experimental configurations, bilingual and multilingual, were established for the speech-to-text task. The bilingual setup involved training models for each language direction (Spanish-English and English-Spanish), while the multilingual task demonstrated adaptation from an existing large-scale multilingual model, SeamlessM4T.

Recent research emphasized the neural end-to-end approach for speech-to-text tasks, aiming for simplicity and efficiency. Initial attempts showed a quality decrease compared to cascade approaches, but subsequent studies improved performance with additional layers. Transformer models, successful in text translation, have been applied to speech translation, achieving quality and training speed improvements.

Simultaneous translation policies fell into three categories: predefined context-free rule-based policies, learnable flexible policies with reinforcement learning, and models using monotonic attention. Monotonic attention, with closed-form expected attention, has shown advancements in online decoding efficiency and translation quality.
In the experimental setup, models were initialized and fine-tuned for both bilingual and multilingual scenarios, demonstrating adaptability and leveraging pre-trained models for efficient training and performance evaluation. The bilingual setup used a pre-trained wav2vec 2.0 encoder and mBART decoder, while the multilingual setting initializes the model with the S2T part of an offline SeamlessM4T model.


In conclusion, the study introduced EMMA for simultaneous speech-to-text translation. EMMA addressed numerical instability, alignment shaping, and simultaneous fine-tuning, achieving state-of-the-art performance. Experimental evaluations emphasizing quality and latency demonstrated the model's efficacy in bilingual and multilingual setups. The adaptation of transformer-based monotonic attention proved crucial for real-time, context-aware speech translation in diverse linguistic scenarios.

Journal reference:
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2023, December 06). Efficient Monotonic Multihead Attention (EMMA) for Real-time Speech-to-Text Translation. AZoAi. Retrieved on May 27, 2024 from https://www.azoai.com/news/20231206/Efficient-Monotonic-Multihead-Attention-(EMMA)-for-Real-time-Speech-to-Text-Translation.aspx.

  • MLA

    Nandi, Soham. "Efficient Monotonic Multihead Attention (EMMA) for Real-time Speech-to-Text Translation". AZoAi. 27 May 2024. <https://www.azoai.com/news/20231206/Efficient-Monotonic-Multihead-Attention-(EMMA)-for-Real-time-Speech-to-Text-Translation.aspx>.

  • Chicago

    Nandi, Soham. "Efficient Monotonic Multihead Attention (EMMA) for Real-time Speech-to-Text Translation". AZoAi. https://www.azoai.com/news/20231206/Efficient-Monotonic-Multihead-Attention-(EMMA)-for-Real-time-Speech-to-Text-Translation.aspx. (accessed May 27, 2024).

  • Harvard

    Nandi, Soham. 2023. Efficient Monotonic Multihead Attention (EMMA) for Real-time Speech-to-Text Translation. AZoAi, viewed 27 May 2024, https://www.azoai.com/news/20231206/Efficient-Monotonic-Multihead-Attention-(EMMA)-for-Real-time-Speech-to-Text-Translation.aspx.


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.