The Indonesian Journal of Computer Science
Vol. 14 No. 6 (2025): The Indonesian Journal of Computer Science

Unsupervised Clustering of Vietnamese Positive and Negative News Using PhoBERT and DBSCAN

Dinh, Long (Unknown)
Le, An (Unknown)



Article Info

Publish Date
30 Dec 2025

Abstract

The proliferation of digital media has made detecting and analyzing sentiment trends in Vietnamese news content increasingly important. This paper proposes an unsupervised learning approach for clustering Vietnamese news articles into positive and negative sentiment categories. The model combines headline and content features using PhoBERT, a Vietnamese-optimized language model, with DBSCAN clustering. Text is encoded using PhoBERT-base for headlines (768 dimensions) and PhoBERT-large for content (1024 dimensions), then concatenated and reduced to 64 dimensions via UMAP before clustering. KeyPhoBERT extracts representative keywords to enhance interpretability. Evaluated on 1,180 manually annotated articles from university social media with inter-annotator agreement of Cohen's kappa 0.83, the model achieves F1-score of 94.37%, with Adjusted Rand Index of 0.87 and Normalized Mutual Information of 0.81. Comparison with BERTopic baseline demonstrates the effectiveness of our approach for Vietnamese sentiment clustering without requiring labeled training data.

Copyrights © 2025






Journal Info

Abbrev

ijcs

Publisher

Subject

Computer Science & IT Electrical & Electronics Engineering Engineering

Description

The Indonesian Journal of Computer Science (IJCS) is a bimonthly peer-reviewed journal published by AI Society and STMIK Indonesia. IJCS editions will be published at the end of February, April, June, August, October and December. The scope of IJCS includes general computer science, information ...