Seminar Nasional Aplikasi Teknologi Informasi (SNATI)
2007

Perbandingan Feature Kata dan Frasa dalam Kinerja Clustering Dokumen Teks Berbahasa Indonesia

Amir Hamzah (Unknown)
Adhi Susanto (Unknown)
F. Soesianto (Unknown)
Jazi Eko Istyanto (Unknown)



Article Info

Publish Date
03 Nov 2009

Abstract

Text document clustering has been intensively studied because of its important role in text-mining andinformation retrieval. High dimensionality problem caused by high number of words is always happened inword-based clustering technique using vector space model. Although extracting words in the preprocessingphase is simple, the collection itself is not only can be viewed as a set of words but also a set of partly more thanone word phrase. Separating a phrase into its parts can eliminate the actual meaning of phrase. Therefore inorder to maintain the context of words a phrase must be maintain as a phrase. It is assumed that by addingphrases to words as features in clustering will improve the performance. This paper will study the comparison ofword-base and phrase-based clustering. Three clustering models was chosen i.e. hierachical, partional andhybrid model. Four similarity technique i.e. GroupAverage, CompleteLink, SingleLink, and ClusterCenter wastried for hierarchical, K-Means and Bisecting K-Mean for partitonal and buckshot for hybrid. Documentcollections from 200-800 news text that has been categorized manually was used to test these algorithms byusing F-measure as criteria of clustering performance. This value was derived from Recall and Precision andcan be used to measure the performance of the algorithms to correctly classify the collections. Results show thatby adding phrases or simply word pair, although it’s still not statistically significant, it slightly improves theperformance of clustering.Keywords: word-base document clustering, phraset-based document clustering, clustering performance

Copyrights © 2007