The integrity of search engines is significantly threatened by manipulative Black Hat SEO (BSEO) tactics, particularly the hidden injection of illicit content such as online gambling. This issue is critically urgent in Indonesia, where attackers frequently compromise government domains (.go.id). By September 2023, over 9,000 such sites had been infiltrated using stealthy defacement and semantic confusion highlighting a gap in existing detection systems that rely on single-dimensional features or ignore real-world class imbalance. To address this, we propose an ensemble learning based detection system combining Random Forest (RF) and Support Vector Machine (SVM), supported by multi-dimensional feature engineering from URLs, meta-tags, hidden CSS/HTML elements, and high-risk keywords (e.g., “slot”, “judi”). Our manually annotated dataset comprises 582 .go.id URLs with a natural 4:1 class imbalance, mitigated via Random Oversampling during training. Evaluation on a balanced test set (146 samples) shows 93.8% ensemble accuracy, 99.6% AUC-ROC, and most critically 100% recall for the Black Hat class, ensuring minimal false negatives. The system also incorporates an internal “override logic” that flags evasion tactics like cloaking or hidden keyword injection, enhancing interpretability. Unlike deep learning alternatives that require large data and computational resources, our approach balances performance, efficiency, and transparency making it suitable for deployment by national cybersecurity agencies. This work advances both academic research and practical defense capabilities against sophisticated BSEO threats targeting public-sector websites.
Copyrights © 2025