Emotion classification from speech has become an important technology in the modern artificial intelligence era. However, research for the Indonesian language is still limited, with existing methods predominantly relying on conventional machine learning approaches that achieve a maximum accuracy of only 90%. These traditional methods face challenges in capturing complex temporal dependencies and bidirectional contextual patterns inherent in emotional speech, particularly for Indonesian prosodic characteristics. To address this limitation, this study uses a combination of Mel-Frequency Cepstral Coefficients (MFCC) feature extraction and Bidirectional Long Short-Term Memory (BiLSTM) model with audio augmentation techniques for Indonesian speech emotion classification. The IndoWaveSentiment dataset contains 300 audio recordings from 10 respondents with five emotion classes: neutral, happy, surprised, disgusted, and disappointed. Audio augmentation techniques with a 2:1 ratio using five methods generated 900 samples. MFCC feature extraction produced 40 coefficients that were processed using BiLSTM architecture with two bidirectional layers (256 and 128 units). The model was trained using Adam optimizer with early stopping. Research results show the highest accuracy of 93.33% with precision of 93.7%, recall of 93.3%, and F1-score of 93.3%. The "surprised" emotion achieved perfect performance (100%), while "happy" had the lowest accuracy (88.89%). This result surpasses previous benchmarks on the same dataset, which utilized Random Forest (90%) and Gradient Boosting (85%). This study demonstrates the effectiveness of combining MFCC, BiLSTM, and audio augmentation in capturing Indonesian speech emotion characteristics for the development of voice-based emotion recognition systems.