The increasing amount of textual content across digital platforms, including social media, news and education, has made it difficult for users to extract useful information efficiently. Therefore, Automatic Text Summarization (ATS) becomes an essential tool for distilling large amount of information while maintaining the core idea. Progress in Arabic ATS remains limited due to the scarcity of annotated datasets, the lack of Arabic-specific NLP tools and the high computational cost of LLM. Additionally, traditional methods often fail to capture sentence-level semantics, limiting summary quality. To address this, we propose a scalable, unsupervised framework that uses TF-IDF-weighted AraBERT embeddings to generate rich sentence representations. To further capture document structure, sentences are grouped using k-means clustering. From each cluster, we identify the most representative sentences using centroid similarity and apply Maximal Marginal Relevance (MMR) as a post-processing redundancy to eliminate sentences that are too similar. Experimental evaluation on the EASC dataset demonstrates that our weighted AraBERT model outperforms traditional embedding techniques such as FastText and Unweighted AraBERT, achieving significant improvements across multiple ROUGE metrics.
Copyrights © 2025