The limited availability of voice emotion corpora in Indonesian poses a challenge for the development of Speech Emotion Recognition (SER) systems, despite growing needs in sectors such as customer service and human-computer interaction. To address this, we developed the Maleo Emotion Audio Corpus, a collection of three-second audio clips with seven emotion labels (angry, neutral, disgusted, sad, happy, afraid, and surprised), sourced from YouTube. The audio data underwent preprocessing, feature extraction (MFCC, ZCR, energy, spectral roll-off, and spectral flux), and augmentation. The classification model was built using a 1D Convolutional Neural Network (CNN) architecture specifically adapted for the 3-second audio features, comprising four convolutional layers. Evaluation showed the model achieved 94.48% accuracy on the test data. The claim of balanced performance is supported by high F1-scores across all classes, ranging from 0.87 for 'sad' to 0.98 for 'neutral', indicating no single class dominated the results. These findings demonstrate that the developed corpus and model architecture have strong capability for recognizing emotions from Indonesian speech in a locally relevant context. Maleo Emotion collection is available at https://doi.org/10.57967/hf/6144.
Copyrights © 2026