Purpose: This study aims to develop and realistically evaluate a reliable model for identifying helpful online reviews, particularly in the context of Indonesian-language texts, which are often informal and challenging. Methods: This study addresses several key challenges in predicting review helpfulness: the relative effectiveness of numerical features from metadata compared with traditional text representations (TF-IDF, FastText) on noisy data; the impact of severe class imbalance; and the limitations of standard validation compared with time-based validation. To address these challenges, we built an XGBoost model and evaluated various feature combinations. A hybrid approach combining SMOTE and scale_pos_weight was applied to handle class imbalance, and the best configuration was further assessed using time-based validation to better simulate real-world conditions. Result: The results show that the model based on numerical features consistently outperformed the text-based model, achieving a peak macro F1-score of 0.7214. Compared to the IndoBERT baseline (F1-score = 0.6400) and the RCNN FastText baseline (F1-score = 0.5317), this indicates that simpler feature-driven models can provide more reliable predictions under noisy review data. Time-based validation further revealed a performance decline of up to 8.06%, confirming the presence of concept drift and highlighting that standard validation tends to yield overly optimistic estimates. Novelty: The main contribution of this research lies in offering a robust methodology while demonstrating the superiority of metadata-based approaches in this context. By quantifying performance degradation through temporal validation, this study provides a more realistic benchmark for real-world applications and highlights the critical importance of regular model retraining.