Enhanced Speech-Emotion Analysis Using Multi-Stage Machine Learning

In a paper published in the journal Scientific Reports, researchers introduced a groundbreaking method to refine speech-emotion analysis in human-machine interaction. Their approach involved a multi-stage process encompassing pre-processing, feature description using spectro-temporal modulation (STM) and entropy features, feature extraction via convolutional neural network (CNN), and classification employing a combination of the gamma classifier (GC) and error-correcting output codes (ECOC).

Study: Enhanced Speech-Emotion Analysis Using Multi-Stage Machine Learning. Image credit: metamorworks/Shutterstock
Study: Enhanced Speech-Emotion Analysis Using Multi-Stage Machine Learning. Image credit: metamorworks/Shutterstock

Evaluation using Berlin and Shahroud Emotional Speech Database (ShEMO) datasets showcased remarkable performance, surpassing existing methods significantly. This method demonstrated substantial potential for enhancing emotion recognition accuracy in speech across diverse applications.


Speech is a fundamental mode of human interaction, implicitly conveying emotions from speaker to listener. As AI expands, systems for human-machine interaction require the ability to process speech, enhancing feedback and comprehensibility. Extracting sentiment from speech is valuable in various domains like online support, lie detection, and customer feedback analysis.

However, speech analysis for emotion recognition has received less attention than text processing due to complexities and noise challenges. Recognizing emotional states within speech involves deciphering complex combinations of basic emotions. Efficiently extracting speech features to interpret emotional patterns and addressing the challenge of numerous emotional states are pivotal.

Speech Emotion Recognition Method

This research method uses specified datasets—ShEMO22 and Berlin Emotional23- to focus on emotion analysis within speech signals. Researchers employed a subset of 150 samples without background noise from the Berlin dataset, which comprises 535 speech signals featuring six emotional categories. These emotions included Anger, Hatred, Fear, Joy, and Sadness.

The ShEMO dataset collected 3000 speech signals from radio shows, encapsulating emotions such as Anger, Fear, Happiness, Neutral State, Sadness, and Surprise. Brevity constraints excluded 874 shorter samples, allowing experiments to proceed on the remaining 2126 samples from the ShEMO dataset. A cross-validation approach with 10 iterations, using 90% of samples for training and 10% for testing, was adopted for evaluations.

The proposed method for speech emotion recognition involves multiple steps: preprocessing the speech signal, describing features using spectro-temporal modulation and entropy, CNN-based feature extraction, and classification via a combination of GC and ECOC models. The preprocessing stage encompasses converting signals to mono-channel, standardizing frequency at 16 kHz, and normalizing signals to zero mean and unit variance.

Following this, feature description involves using entropy features (approximate and sample entropy) to characterize general speech signal traits and spectro-temporal modulation to portray auditory system modeling and temporal modulation based on Auditory Spectrogram (AS). Transformations simulate human auditory system processing stages to obtain the AS. Additionally, a CNN structure extracts features from the spectro-temporal modulation, resulting in a feature vector combined with entropy features for subsequent emotion recognition.

The classification phase integrates ECOC and GC models, offering a comprehensive solution for multi-class problems like emotion recognition in speech. The ECOC method encodes classes into binary code words using the code matrix. Each matrix column corresponds to a binary classifier trained based on the ECOC codes. The GC uses the generalized gamma operator, tackling multi-class problems by assigning unique binary codes to classes, employing a stopping parameter, and modifying Johnson-Mobius coding for pattern classification.

Throughout these steps, the proposed method progresses from preprocessing the speech signals to extracting relevant features using entropy and spectro-temporal modulation. It involves using CNN-based feature extraction and employing a combination of ECOC and GC models to achieve accurate emotion recognition in speech.

Proposed Method Evaluation Summary

Using MATrix LABoratory (MATLAB) 2016a software, the proposed method evaluated the accuracy and classification quality across tenfold cross-validation iterations. It scrutinized effectiveness using speech signals from ShEMO22 and Berlin Emotional23 datasets, detailing their specifications earlier. The evaluation process, conducted separately for each dataset, showcased the method's accuracy against other sentiment extraction methodologies during 10 cross-validation iterations.

Across most iterations, the proposed method demonstrated superior accuracy compared to others using the same extracted features. The higher accuracy stems from the performance of the classification model, which combines ECOC and GC for emotion recognition. The method's effectiveness was notable in both datasets, achieving an average accuracy of 93.33% for Berlin and 85.73% for ShEMO.

The method excels in identifying emotional states, although it faces challenges recognizing surprise emotions in the ShEMO dataset. Moreover, precision, recall, and F-Measure showcase better separation of emotional states, confirming its higher effectiveness than previous methodologies. Furthermore, the proposed method outperforms a prior study's accuracy by 2.26% in recognizing anger and neutral emotional states. This improvement reinforces the efficacy of the proposed method, particularly in enhancing recognition accuracy compared to earlier techniques.


To sum up, this paper introduces a novel method utilizing machine learning techniques to recognize emotions in speech. It leverages entropy features and spectro-temporal modulation for speech characterization, employing a CNN for feature extraction.

Additionally, it introduces a new model that combines GC and ECOC for precise feature classification and emotional state recognition. Collectively, these approaches yield higher accuracy and efficiency than previous methodologies. The method's evaluation, conducted using the Berlin and ShEMO datasets, demonstrates its prowess, achieving average accuracies of 93.33% and 85.73%, respectively. These results showcase an improvement of at least 2.1% over prior methods. Furthermore, by integrating spectro-temporal modulation and entropy, this method achieves a 2.26% increase in emotional state recognition accuracy compared to existing techniques.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2023, November 24). Enhanced Speech-Emotion Analysis Using Multi-Stage Machine Learning. AZoAi. Retrieved on February 24, 2024 from https://www.azoai.com/news/20231124/Enhanced-Speech-Emotion-Analysis-Using-Multi-Stage-Machine-Learning.aspx.

  • MLA

    Chandrasekar, Silpaja. "Enhanced Speech-Emotion Analysis Using Multi-Stage Machine Learning". AZoAi. 24 February 2024. <https://www.azoai.com/news/20231124/Enhanced-Speech-Emotion-Analysis-Using-Multi-Stage-Machine-Learning.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Enhanced Speech-Emotion Analysis Using Multi-Stage Machine Learning". AZoAi. https://www.azoai.com/news/20231124/Enhanced-Speech-Emotion-Analysis-Using-Multi-Stage-Machine-Learning.aspx. (accessed February 24, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2023. Enhanced Speech-Emotion Analysis Using Multi-Stage Machine Learning. AZoAi, viewed 24 February 2024, https://www.azoai.com/news/20231124/Enhanced-Speech-Emotion-Analysis-Using-Multi-Stage-Machine-Learning.aspx.


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
You might also like...
Graph-Based Machine Learning for Advanced Cyber Threat Detection