Claim Missing Document
Check
Articles

Found 7 Documents
Search
Journal : Bulletin of Electrical Engineering and Informatics

Multimodal music emotion recognition in Indonesian songs based on CNN-LSTM, XLNet transformers Sams, Andrew Steven; Zahra, Amalia
Bulletin of Electrical Engineering and Informatics Vol 12, No 1: February 2023
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v12i1.4231

Abstract

Music carries emotional information and allows the listener to feel the emotions contained in the music. This study proposes a multimodal music emotion recognition (MER) system using Indonesian song and lyrics data. In the proposed multimodal system, the audio data will use the mel spectrogram feature, and the lyrics feature will be extracted by going through the tokenizing process from XLNet. Convolutional long short term memory network (CNN-LSTM) performs the audio classification task, while XLNet transformers performs the lyrics classification task. The outputs of the two classification tasks are probability weight and actual prediction with the value of positive, neutral, and negative emotions, which are then combined using the stacking ensemble method. The combined output will be trained into an artificial neural network (ANN) model to get the best probability weight output. The multimodal system achieves the best performance with an accuracy of 80.56%. The results showed that the multimodal method of recognizing musical emotions gave better performance than the single modal method. In addition, hyperparameter tuning can affect the performance of multimodal systems.
Classifying possible hate speech from text with deep learning and ensemble on embedding method Caprisiano, Ebenhaiser Jonathan; Ramadhansyah, Muhammad Hafizh; Zahra, Amalia
Bulletin of Electrical Engineering and Informatics Vol 13, No 3: June 2024
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v13i3.6041

Abstract

Hate speech can be defined as the use of language to express hatred towards another party. Twitter is one of the most widely used social media platforms in the community. In addition to submitting user-generated content, other users can provide feedback through comments. There are several users who intentionally or unintentionally provide negative comments. Even though there are regulations regarding the prohibition of hate speech, there are still those who make negative comments. Using the deep learning method with the long short-term memory (LSTM) model, a classifier of possible hate speech from messages on Twitter is carried out. With the ensemble method, term frequency times inverse document frequency (TF-IDF) and global vector (GloVe) get 86% accuracy, better than the stand-alone word to vector (Word2Vec) method, which only gets 80%. From these results, it can be concluded that the ensemble method can improve accuracy compared to only using the stand-alone method. Ensemble methods can also improve the performance of deep learning systems and produce better results than using only one method.
Speech emotion recognition with optimized multi-feature stack using deep convolutional neural networks Fadhil, Muhammad Farhan; Zahra, Amalia
Bulletin of Electrical Engineering and Informatics Vol 13, No 6: December 2024
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v13i6.6044

Abstract

The human emotion in communication plays a significant role that can influence how the context of the message is perceived by others. Speech emotion recognition (SER) is one of a field study that is very intriguing to explore because human-computer interaction (HCI) related technologies such as virtual assistant that are implemented nowadays rarely considered the emotion contained in the information relayed by human speech. One of the most widely used ways to perform SER is by extracting features of speech such as mel frequency cepstral coefficient (MFCC), mel-spectrogram, spectral contrast, tonnetz, and chromagram from the signal and using a one-dimensional (1D) convolutional neural network (CNN) as a classifier. This study shows the impact of implementing a combination of an optimized multi-feature stack and optimized 1D deep CNN model. The result of the model proposed in this study has an accuracy of 90.10% for classifying 8 different emotions performed on the ryerson audio-visual database of emotional speech and song (RAVDESS) dataset.
Enhancing speech emotion recognition with deep learning using multi-feature stacking and data augmentation Al Mukarram, Khasyi; Mukhlas, M. Anang; Zahra, Amalia
Bulletin of Electrical Engineering and Informatics Vol 13, No 3: June 2024
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v13i3.6049

Abstract

This study evaluates the effectiveness of data augmentation on 1D convolutional neural network (CNN) and transformer models for speech emotion recognition (SER) on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset. The results show that data augmentation has a positive impact on improving emotion classification accuracy. Techniques such as noising, pitching, stretching, shifting, and speeding are applied to increase data variation and overcome class imbalance. The 1D CNN model with data augmentation achieved 94.5% accuracy, while the transformer model with data augmentation performed even better at 97.5%. This research is expected to contribute better insights for the development of accurate emotion recognition methods by using data augmentation with these models to improve classification accuracy on the RAVDESS dataset. Further research can explore larger and more diverse datasets and alternative model approaches.
Multimodal speech emotion recognition optimization using genetic algorithm Michael, Stefanus; Zahra, Amalia
Bulletin of Electrical Engineering and Informatics Vol 13, No 5: October 2024
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v13i5.7409

Abstract

Speech emotion recognition (SER) is a technology that can detect emotions in speech. Various methods have been used in developing SER, such as convolutional neural networks (CNNs), long short-term memory (LSTM), and multilayer perceptron. However, sometimes in addition to model selection, other techniques are still needed to improve SER performance, namely optimization methods. This paper compares manual hyperparameter tuning using grid search (GS) and hyperparameter tuning using genetic algorithm (GA) on the LSTM model to prove the performance increase in the multimodal SER model after optimization. The accuracy, precision, recall, and F1 score improvement obtained by hyperparameter tuning using GA (HTGA) is 2.83%, 0.02, 0.05, and 0.04, respectively. Thus, HTGA obtains better results than the baseline hyperparameter tuning method using a GS.
Enhancing detection of zero-day phishing email attacks in the Indonesian language using deep learning algorithms Roesmiatun Purnamadewi, Yasinta; Zahra, Amalia
Bulletin of Electrical Engineering and Informatics Vol 14, No 1: February 2025
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v14i1.8759

Abstract

Email phishing is a manipulative technique aimed at compromising information security and user privacy. To overcome the limitations of traditional detection methods, such as blacklists, this research proposes a phishing detection model that leverages natural language processing (NLP) and deep learning technologies to analyze Indonesian email headers. The primary objective is to more efficiently detect zero-day phishing attacks by focusing on the unique linguistic and cultural context of the Indonesian language. This enables the development of models capable of recognizing phishing attack patterns that differ from those in other language contexts. Four models are tested, combining Indonesian bidirectional encoder representation of transformers (IndoBERT) and FastText feature extraction techniques with convolutional neural network (CNN) and long short-term memory (LSTM) deep learning algorithms. The results indicate that the combination of FastText and CNN achieved the highest performance in accuracy, precision, and F1-score metrics, each at 98.4375%. Meanwhile, the FastText model with LSTM showed the best performance in recall, with a score of 98.9583%. The research suggests exploring deeper into email content or integrating analysis between headers and email content in future studies to further improve accuracy and effectiveness in phishing email detection.
The use of generative adversarial network as a domain adaptation method for cross-corpus speech emotion recognition Farhan Fadhil, Muhammad; Zahra, Amalia
Bulletin of Electrical Engineering and Informatics Vol 14, No 1: February 2025
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v14i1.8339

Abstract

The research of speech emotion recognition (SER) is growing rapidly. However, SER still faces a cross-corpus SER problem which is performance degradation when a single SER model is tested in different domains. This study shows the impact of implementing a generative adversarial network (GAN) model for adapting speech data from different domains and performs emotion classification from the speech features using a 1D convolutional neural network (CNN) model. The results of this study found that the domain adaptation approach using a GAN model could improve the accuracy of emotion classification in speech data from 2 different domain such as the ryerson audio-visual database of emotional speech and song (RAVDESS) speech corpus and the EMO-DB speech corpus ranging from 10.88% to 28.77%, with the highest average performance increase across three different class balancing method reaching 18.433%.