Garuda - Garba Rujukan Digital

Jurnal technoscientia

Technoscentia Vol 1 No 2 Februari 2009

Amir Hamzah (Teknik Informatika, IST AKPRIND Yogyakarta)

Publish Date
01 Feb 2009

Text document clustering has been intensively studied because of its important role in text-mining and information retrieval. High dimensionality problem caused by high number of words is always happened in word-based clustering technique using vector space model. Although extracting words in the preprocessing phase is simple, the collection itself can not only be viewed as a set of words but also a set of partly more than one word phrase. Separating a phrase into its parts can eliminate the actual meaning of phrase. Therefore in order to maintain the context of words a phrase must be maintained as a phrase. It is assumed that by adding phrases to words as features in clustering will improve the performance. This paper will study the comparison of word-based and phrase-based clustering. Two clustering models were chosen i.e. hierarchical and partition. Four similarity techniques i.e.: Group Average, Complete Link, Single Link, and Cluster Center were tried for hierarchical, K-Means and Bisecting K-Mean and Buckshot for partition. A document collection from 200-800 news text that has been categorized ma-nually was used to test these algorithms by using F-measure as criteria of clustering performance. This value was derived from Recall and Precision and can be used to measure the performance of the algorithms to correctly classify the collections. Results show that by adding phrases or simply word pair, although it’s still not statistically significant, it slightly improves the performance of clustering.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

677.3 KB