The spread of hoaxes on social media has become a systemic threat, potentially triggering opinion polarization, mass panic, and disruption of social stability. Previous research has primarily focused on hoax detection through classification, while predictive efforts to anticipate the extent of their spread remain limited. This study aims to develop a machine learning model to predict the propagation level of hoax content on social media (low, medium, high) and identify the most influential factors contributing to its virality. The dataset was collected from TurnBackHoaks and MAFINDO repositories, comprising 2,500 Indonesian-language hoax contents published throughout 2022-2023. Feature extraction included TF-IDF-based text features and sentiment analysis, temporal features (upload time), and early engagement features (number of likes, shares, comments within the first hour). Three algorithms were compared: Logistic Regression, Random Forest, and XGBoost, with class imbalance handled using SMOTE. The results showed that XGBoost achieved the best performance with a macro average F1-score of 0.82, outperforming Random Forest (0.79) and Logistic Regression (0.70). SHAP analysis revealed that early engagement (shares and likes within the first hour) was the most dominant predictor, followed by content emotionality and nighttime uploads. The model demonstrated high sensitivity to the high-spread class (recall 0.85), indicating its potential for integration into early warning systems by social media platforms and fact-checking organizations. This research contributes to the development of predictive approaches in disinformation mitigation and the strengthening of digital literacy in Indonesia.
Copyrights © 2025