This research addresses the challenge of comprehensively analyzing textual data, emphasizing the prevalence of harmful language, sentiment expression, and thematic content. The research problem centers around interpreting large datasets, prompting a multifaceted methodology. Drawing upon the Cross-Industry Standard Process for Data Mining (CRISP-DM), the study follows a systematic approach involving six key phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Toxicity analysis reveals an average toxicity level ranging from 0.00404 to 0.03878 and maximum values up to 0.66151, highlighting varying degrees of harmful language prevalence. Sentiment analysis identifies that 60% of sentiments expressed are positive, 30% are neutral, and 10% are negative, elucidating prevailing attitudes. Topic modeling extracts twelve distinct themes, enriching the interpretive depth of the dataset. Performance evaluation metrics for SVM using SMOTE indicate an accuracy of 91.41% +/- 1.66%, with 832 true negatives and 689 true positives, affirming the model's reliability. Based on these findings, it is recommended that stakeholders implement robust content moderation strategies to mitigate the dissemination of harmful language, foster a safer online environment, and leverage sentiment and topic analysis insights for informed decision-making. This interdisciplinary approach enhances data analysis capabilities, providing actionable insights crucial for addressing societal challenges and advancing scholarly discourse.
Copyrights © 2024