This study addresses key challenges in Indonesian sentiment analysis related to preprocessing, labeling strategies, and class imbalance. It compares the performance of BiLSTM and IndoBERT using user reviews collected from Tokopedia. The dataset was manually and automatically labeled, then processed under three preprocessing schemes. Both models were trained with tuned hyperparameters and imbalance-handling techniques and evaluated through twenty rounds of stratified five-fold cross-validation. Performance was assessed using balanced accuracy and F1-score. IndoBERT achieved the highest results, with balanced accuracy up to 0.85 and F1-scores up to 0.83, while BiLSTM reached balanced accuracy up to 0.78 and F1-scores up to 0.76. Applying class weight and focal loss improved model performance by approximately 2% to 11% over the baseline. BiLSTM demonstrated greater training efficiency, requiring only 1 to 2.5 minutes per epoch, compared with IndoBERT’s 2.6 to 3.6 minutes. Although manual labeling remained superior in capturing contextual nuance and emotional cues, GPT-based labeling showed strong agreement with the human annotations. A four-way ANOVA revealed that all main factors and several interactions significantly influenced classification outcomes. Overall, BiLSTM provides faster training efficiency, whereas IndoBERT delivers higher predictive accuracy.