Emotions in speech are considered a basic principle of human interaction and play an important role in decision-making, learning, and everyday communication. Research on speech emotion recognition is still being carried out by many researchers to develop speech emotion recognition models with better performance. In this research, we combine the application of data augmentation techniques (Add Noise, Time Stretch, and Pitch Shift) to increase the data size of the Javanese Speech Emotion Database (Java-SED). Mel Frequency Cepstral Coefficients (MFCC) is used for feature extraction and then builds a Convolutional Neural Network (CNN) model and applies Multilayer Perceptron (MLP) to classify human emotions from sound. In this research, we produced eight experimental models with a combination of different augmentation techniques. The CNN model parameters include 40 input neurons, four hidden layers with varying neuron counts, Relu activation functions, L2 regularization, dropout rates, Adam optimization, and ModelCheckpoint callbacks to save the best model based on validation loss. From the results of the evaluation that has been carried out, the CNN algorithm produces the highest performance with an accuracy of 96.43%, recall of 96.43%, precision of 96.57%, F1-score of 96.48%, and kappa of 95.71% by applying the Add Noise technique, Time Stretch, and Pitch Shift.
Copyrights © 2024