This study presents a systematic literature review (SLR) on the automatic detection of cyberbullying across multiple media modalities, including text, images, and videos, between 2020 and 2025. Unlike previous SLRs that focused only on textual or unimodal data, this research provides a comprehensive synthesis of multimodal approaches that integrate linguistic, visual, and audiovisual cues. Using the PRISMA framework, 4,272 records were screened, resulting in 120 studies for full analysis. The findings reveal a sharp increase in publications in 2025, driven by advances in large language models (LLMs), multimodal transformers, and heightened global attention to online safety. Quantitatively, 69% of studies focused on text-based detection, 21% on multimodal (text-image), and 10% on video-based approaches. NLP, CNN, SVM, BERT, and LSTM remain the most commonly used models, while emerging hybrid frameworks (e.g., ResNet–BiLSTM) show promising performance. Previous studies were often limited by real-time detection capabilities, fairness concerns, and lack of explainable AI. This SLR addresses those gaps by synthesizing methodological trends, highlighting ethical challenges, and identifying opportunities for future integration of explainable and human-centered AI. The practical implication of this study lies in providing a structured reference for researchers, policymakers, and social media platforms to design fair, transparent, and adaptive cyberbullying detection systems.