ABSTRAK: Penelitian ini mengkaji efektivitas algoritma Naive Bayes dan Support Vector Machine dalam mengklasifikasikan ujaran kebencian dan teks kasar berbahasa Indonesia, dengan tujuan memahami keunggulan dan pola kegagalan setiap model. Metode penelitian menggunakan dataset tweet beranotasi yang melalui pra-pemrosesan khusus dan ekstraksi fitur TF-IDF. Kinerja model Multinomial Naive Bayes (MNB) dan Linear Support Vector Classifier (LinearSVC) dievaluasi secara ketat menggunakan holdout test set, validasi silang, dan uji signifikansi statistik. Hasil penelitian mengonfirmasi superioritas signifikan LinearSVC atas MNB, dengan selisih akurasi 8,7%. Analisis mendalam mengungkap bahwa meskipun model sangat akurat mengidentifikasi konten bersih (clean), tantangan utama terletak pada klasifikasi teks kasar (abusive) yang sering tertukar dengan ujaran kebencian (hate speech). Pola kesalahan ini menunjukkan batas kabur antara kedua kategori, mengindikasikan spektrum kontinu ketimbang kelas diskrit. Diskusi menegaskan bahwa keunggulan LinearSVC berasal dari kemampuannya menangani data berdimensi tinggi dan kompleksitas linguistik khas media sosial Indonesia. Temuan ini menyoroti perlunya pendekatan pemodelan yang lebih peka terhadap nuansa bahasa dan gradasi keparahan konten. Implikasi studi mendorong penyempurnaan skema anotasi data, eksplorasi metode seperti klasifikasi multi-label atau regresi ordinal, serta pengembangan sistem moderasi hibrid yang memprioritaskan area abu-abu antara teks kasar dan ujaran kebencian untuk keadilan dan akurasi yang lebih baik. KATA KUNCI: klasifikasi teks; ujaran kebencian; bahasa kasar; Support Vector Machine; Naive Bayes; media sosial Indonesia. COMPARATIVE ANALYSIS OF NAIVE BAYES AND SUPPORT VECTOR MACHINE ALGORITHMS IN CLASSIFYING HATE SPEECH AND ABUSIVE INDONESIAN TEXT ABSTRACT: This study examines the effectiveness of Naive Bayes and Support Vector Machine algorithms in classifying hate speech and abusive Indonesian text, aiming to understand the strengths and failure patterns of each model. The research method uses an annotated tweet dataset that undergoes specialized preprocessing and TF-IDF feature extraction. The performance of the Multinomial Naive Bayes (MNB) and Linear Support Vector Classifier (LinearSVC) models is rigorously evaluated using a holdout test set, cross-validation, and statistical significance testing. The results confirm the significant superiority of LinearSVC over MNB, with an accuracy difference of 8.7%. In-depth analysis reveals that while the model is highly accurate in identifying clean content, the main challenge lies in classifying abusive text, which is often confused with hate speech. This error pattern indicates a blurred boundary between the two categories, suggesting a continuous spectrum rather than discrete classes. The discussion affirms that LinearSVC's advantage stems from its ability to handle high-dimensional data and the linguistic complexity typical of Indonesian social media. These findings highlight the need for modeling approaches that are more sensitive to linguistic nuances and content severity gradations. The study's implications encourage the refinement of data annotation schemes, exploration of methods such as multi-label classification or ordinal regression, and the development of hybrid moderation systems that prioritize the gray area between abusive text and hate speech for greater fairness and accuracy. KEYWORDS: text classification; hate speech; abusive language; Support Vector Machine; Naive Bayes; Indonesian social media.
Copyrights © 2026