Voice-based emotion detection technology (SER), is the study of machines' ability to comprehend patterns in voice data, utilizing a range of methods and features. However, its utilizations remains limited due to the inherent challenges faced by machines in accurately discerning emotions. This research was conducted using a frequently used method, namely CNN and was developed to produce a high-accuracy method, with spectrogram features due to their capacity to record frequencies in RAVDESS. The data set comprised 2068 voice samples classified into five emotion classes: angry, afraid, happy, sad, and neutral. The augmentation of all data regarding noise, pitch, shifting, stretching, and high and low speed, was implemented to replicate real-world conditions. This research was conducted by training on several parameters such as: learning rate, dropout rate, kernel, weight decay size, optimization, epochs, and batch size. This research resulted in a CNN method with the best parameter values produced {weight_decay': 1e-07, 'optimizer': 'adamw', 'learning_rate': 0.001, 'kernel_initializer': 'he_normal', 'dropout_rate': 0.5, 'epochs': 100, 'batch_size': 48}, which has score value of 0.7448840381991815. The model demonstrated a general accuracy level of 75.85% for the training data and 51.64% for the test data, indicating its ability to recognize existing patterns but difficulty in generalizing new data. However, the ROC curve values indicate that the model is capable of differentiating voice data into its respective classes, with values of 0.84 for angry emotions, 0.79 for fear emotions, 0.83 for happy emotions, 0.80 for sad emotions, and 0.9 for neutral emotions.
Copyrights © 2024