Indonesia’s social media platforms contain large amounts of unverified health information. Research on Indonesian health-text mining still rarely focuses on disease-based classification, leaving a gap compared with studies that only address sentiment or general topic categorization. This study proposes a multi-class classification approach that uses IndoBERT embeddings combined with gradient-boosting classifiers (XGBoost and LightGBM) to categorize tweets into diabetes, hypertension, and heart disease. The dataset comprises 4,075 tweets collected from platform X (Twitter). Preprocessing involves text cleaning, anonymization, normalization, and the extraction of 768-dimensional IndoBERT embeddings. Experiments are conducted in Google Colab (Intel Xeon CPU, 13 GB RAM, optional NVIDIA T4 GPU) using stratified five-fold cross-validation.The best results are obtained by the IndoBERT × LightGBM pipeline, which achieves an accuracy of 0.8526 and a macro-averaged F1-score of 0.8527, outperforming the IndoBERT × XGBoost model (accuracy 0.8325 and macro F1-score 0.8326). Feature-importance analysis shows that contextual terms related to blood sugar, the heart, and blood pressure strongly influence the predictions. Overall, the proposed method provides an effective baseline for monitoring health-related text and supporting disease-oriented analytics in Indonesian-language social media.
Copyrights © 2026