Claim Missing Document
Check
Articles

Found 2 Documents
Search

Enhancing Arabic Extractive Summarization with TF-IDF-Weighted AraBERT Sentence Embeddings and Semantic Clustering R. Naji, Wadeea; Suresha; Fahd A. Ghanem
The Indonesian Journal of Computer Science Vol. 14 No. 5 (2025): The Indonesian Journal of Computer Science
Publisher : AI Society & STMIK Indonesia

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.33022/ijcs.v14i5.4999

Abstract

The increasing amount of textual content across digital platforms, including social media, news and education, has made it difficult for users to extract useful information efficiently. Therefore, Automatic Text Summarization (ATS) becomes an essential tool for distilling large amount of information while maintaining the core idea. Progress in Arabic ATS remains limited due to the scarcity of annotated datasets, the lack of Arabic-specific NLP tools and the high computational cost of LLM. Additionally, traditional methods often fail to capture sentence-level semantics, limiting summary quality. To address this, we propose a scalable, unsupervised framework that uses TF-IDF-weighted AraBERT embeddings to generate rich sentence representations. To further capture document structure, sentences are grouped using k-means clustering. From each cluster, we identify the most representative sentences using centroid similarity and apply Maximal Marginal Relevance (MMR) as a post-processing redundancy to eliminate sentences that are too similar. Experimental evaluation on the EASC dataset demonstrates that our weighted AraBERT model outperforms traditional embedding techniques such as FastText and Unweighted AraBERT, achieving significant improvements across multiple ROUGE metrics.
IPTSS Intelligent Preprocessing and Multi-Representation Analysis for Social Media Text Summarization with Clustering-Based Enhancement A. Ghanem, Fahd; C. Padma, M.; R. Naji, Wadeea
The Indonesian Journal of Computer Science Vol. 15 No. 1 (2026): The Indonesian Journal of Computer Science
Publisher : AI Society & STMIK Indonesia

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.33022/ijcs.v15i1.5086

Abstract

        Social media platforms generate massive volumes of noisy, informal short texts, creating significant challenges for automatic text summarization. This paper presents IPTSS (Intelligent Preprocessing and Transformation System for Social Media Summarization), a unified framework that integrates intelligent preprocessing, multi-representation text modeling, and clustering-based extractive summarization into a single end-to-end pipeline. IPTSS incorporates a four-stage intelligent preprocessing pipeline for redundancy elimination, platform-noise removal, out-of-vocabulary normalization, and linguistic standardization, a multi-representation analysis layer spanning statistical, distributional, and transformer-based models, and a hybrid TF-IDF–weighted BERT representation that fuses corpus-specific lexical importance with contextual semantic information. Summarization is performed through clustering-based representative selection with redundancy control to ensure topical diversity and coverage. Extensive experiments on large-scale datasets collected from X (formerly Twitter) across the Monkeypox, COVID-19 Vaccine, and Climate Change domains demonstrate that preprocessing alone yields a 25.8% improvement in ROUGE-1, while representation sophistication produces a 38.4% gain from Bag-of-Words to Sentence-BERT. The proposed hybrid representation further improves performance by 7.0% over the best single-representation baseline, achieving the highest scores across all ROUGE metrics. The optimal configuration (Fuzzy C-Means + IPTSS Hybrid) reaches ROUGE-1 = 0.528, outperforming state-of-the-art statistical, graph-based, crisis-specific, neural, and optimization-based methods. Cross-dataset validation confirms strong generalizability, with low performance variance (CV ≈ 2.5%) across heterogeneous domains without dataset-specific tuning. These results demonstrate that effective social media summarization is driven primarily by preprocessing quality and hybrid representation design rather than algorithmic complexity alone, establishing IPTSS as a robust, scalable, and generalizable framework for large-scale social media extractive summarization.