Extractive text summarization is a fundamental approach to tackle information overload, yet its quality is highly dependent on the pre-processing stage. Despite its crucial role, there is no consensus on the most optimal pre-processing scenario for the Indonesian language, which has a complex morphological structure. This study aims to fill this research gap by systematically analyzing the impact of seven pre-processing scenarios on four summarization methods: three graph-based methods (LexRank, TextRank, DivRank) and one topic-relevance method (Cosine Similarity against the title). Using a corpus of 3,000 Indonesian news articles and ROUGE evaluation metrics, the results show two key findings. First, the Cosine Similarity method significantly outperforms all graph-based methods, achieving the highest F1-Measure scores on ROUGE-1 (0.5073), ROUGE-2 (0.4018), and ROUGE-L (0.4574), which emphasizes the important role of the title in news texts. Second, a comprehensive pre-processing scenario involving Case Folding, Punctuation Removal, Tokenization, Normalization, Negation Handling, Stopword Removal and Stemming proves to be the most effective in improving the performance of all algorithms. These findings provide empirical evidence and practical recommendations that the combination of a title-relevancy approach with proper text normalization is the most effective strategy for optimizing extractive text summarization for the Indonesian language.
Copyrights © 2025