Claim Missing Document
Check
Articles

Found 5 Documents
Search
Journal : Bulletin of Electrical Engineering and Informatics

Spoken language identification using i-vectors, x-vectors, PLDA and logistic regression Ahmad Iqbal Abdurrahman; Amalia Zahra
Bulletin of Electrical Engineering and Informatics Vol 10, No 4: August 2021
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v10i4.2893

Abstract

In this paper, i-vector and x-vector is used to extract the features from speech signal from local Indonesia languages, namely Javanese, Sundanese and Minang languages to help classifier identify the language spoken by the speaker. Probabilistic linear discriminant analysis (PLDA) are used as the baseline classifier and logistic regression technique are used because of prior studies showing logistic regression has better performance than PLDA for classifying speech data. Once these features are extracted. The feature is going to be classified using the classifier mentioned before. In the experiment, we tried to segment the test data to three segment such as 3, 10, and 30 seconds. This study is expanded by testing multiple parameters on the i-vector and x-vector method then comparing PLDA and logistic regression performance as its classifier. The x-vector has better score than i-vector for every segmented data while using PLDA as its classifier, except where the i-vector and x-vector is using logistic regression, i-vector still has better accuracy compared to x-vector.
Spoken language identification on 4 Indonesian local languages using deep learning Panji Wijonarko; Amalia Zahra
Bulletin of Electrical Engineering and Informatics Vol 11, No 6: December 2022
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v11i6.4166

Abstract

Language identification is at the forefront of assistance in many applications, including multilingual speech systems, spoken language translation, multilingual speech recognition, and human-machine interaction via voice. The identification of indonesian local languages using spoken language identification technology has enormous potential to advance tourism potential and digital content in Indonesia. The goal of this study is to identify four Indonesian local languages: Javanese, Sundanese, Minangkabau, and Buginese, utilizing deep learning classification techniques such as artificial neural network (ANN), convolutional neural network (CNN), and long-term short memory (LSTM). The selected extraction feature for audio data extraction employs mel-frequency cepstral coefficient (MFCC). The results showed that the LSTM model had the highest accuracy for each speech duration (3 s, 10 s, and 30 s), followed by the CNN and ANN models.
Multi-feature stacking order impact on speech emotion recognition performance Yoga Tanoko; Amalia Zahra
Bulletin of Electrical Engineering and Informatics Vol 11, No 6: December 2022
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v11i6.4287

Abstract

One of the biggest challenges in implementing SER is to produce a model that performs well and is lightweight. One of the ways is using one-dimensional convolutional neural network (1D CNN) and combining some handcrafted features. 1D CNN is mostly used for time series data. In time series data, the order of information plays an important role. In this case, the order of stacked features also plays an important role. In this work, the impact of changing the order is analyzed. This work proposes to brute force all possible combinations of feature orders from five features: Mel-frequency cepstral coefficient (MFCC), Mel-spectrogram, chromagram, spectral contrast, and tonnetz, then uses 1D CNN as the model architecture and benchmarking the model's performance on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset. The results show that changing the order of features can impact overall classification accuracy, specific emotion accuracy, and model size. The best model has an accuracy of 79.17% for classifying 8 emotion classes with the following order: spectral contrast, tonnetz, chromagram, Mel-spectrogram, and MFCC. Finding a suitable order can increase the accuracy up to 16.05% and reduce the model size up to 96%.
Stacking ensemble learning for optical music recognition Francisco Calvin Arnel Ferano; Amalia Zahra; Gede Putra Kusuma
Bulletin of Electrical Engineering and Informatics Vol 12, No 5: October 2023
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v12i5.5129

Abstract

The development of music culture has resulted in a problem called optical music recognition (OMR). OMR is a task in computer vision that explores the algorithms and models to recognize musical notation. This study proposed the stacking ensemble learning model to complete the OMR task using the common western musical notation (CWMN) musical notation. The ensemble learning model used four deep convolutional neural networks (DCNNs) models, namely ResNeXt50, Inception-V3, RegNetY-400MF, and EfficientNet-V2-S as the base classifier. This study also analysed the most appropriate technique to be used as the ensemble learning model’s meta-classifier. Therefore, several machine learning techniques are determined to be evaluated, namely support vector machine (SVM), logistic regression (LR), random forest (RF), K-nearest neighbor (KNN), decision tree (DT), and Naïve Bayes (NB). Six publicly available OMR datasets are combined, down sampled, and used to test the proposed model. The dataset consists of the HOMUS_V2, Rebelo1, Rebelo2, Fornes, OpenOMR, and PrintedMusicSymbols datasets. The proposed ensemble learning model managed to outperform the model built in the previous study and succeeded in achieving outstanding accuracy and F1-scores with the best value of 97.51% and 97.52%, respectively; both of which were achieved by the LR meta-classifier.
Data augmentation and enhancement for multimodal speech emotion recognition Jonathan Christian Setyono; Amalia Zahra
Bulletin of Electrical Engineering and Informatics Vol 12, No 5: October 2023
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v12i5.5031

Abstract

Humans’ fundamental need is interaction with each other such as using conversation or speech. Therefore, it is crucial to analyze speech using computer technology to determine emotions. The speech emotion recognition (SER) method detects emotions in speech by examining various aspects. SER is a supervised method to decide the emotion class in speech. This research proposed a multimodal SER model using one of the deep learning based enhancement techniques, which is the attention mechanism. Additionally, this research addresses the imbalanced dataset problem in the SER field using generative adversarial networks (GAN) as a data augmentation technique. The proposed model achieved an excellent evaluation performance of 0.96 or 96% for the proposed GAN configuration. This work showed that the GAN method in the multimodal SER model could enhance performance and create a balanced dataset.