This study introduces a comprehensive approach to emotion recognition in speech using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The method integrates several state-of-the-art deep learning models known for their proficiency in pattern recognition and audio processing. The RAVDESS dataset comprises diverse audio files featuring emotional expressions by professional actors, meticulously categorized by modality, emotion, intensity, and other attributes. These data are utilized to train and evaluate various deep learning architectures including AlexNet, ResNet, InceptionNet, VGG16, and VGG19, as well as recurrent neural network (RNN) models such as LSTM and the latest transformer models. The analysis results indicate that the Transformer model excels with higher accuracy, precision, recall, and F1 score in emotion classification tasks compared to other models. This study not only enhances understanding of subtle emotional nuances in spoken language but also establishes new benchmarks in applying diverse neural network types for emotion recognition from audio. By providing detailed comparisons among models, this research advances the technology of emotion recognition, enhancing its applications in human-computer interaction, psychotherapy, entertainment industry, and paving the way for further development in multimodal emotion recognition systems.
Copyrights © 2024