Text document clustering is a technique which has been intensively studied be-cause of its important role in the text-mining and information retrieval. In the vector spa-ce model it is typically known two main clustering approaches,i.e. hierachical algorithm and partitional algorithm. The hierarchical algorithm produces deterministic result known as a dendogram, but its weakness is high complexity in time and memory. On the other hand, partitiaonal algorithm has linear time and memory complexity although its clustering result is not independent from its initial cluster. The aim of this research was to study experimentally to compare the perfor-mances of several techniques of hierarchical algorithms and partitional algorithms applied to text documents written in Bahasa Indonesia. The five similarity techniques i.e. UPGM-A, CSI, IST,SL and CL were chosen from hierarchical, whereas K-Means, Bisecting K-Mean and Buckshot are chosen for partitonal ones. The documents were collected from 200 to 800 Indonesian news text that have been categorized manually and used to test these algorithms using F-measure for clustering performance. This value was derived from Recall and Precision and can be used to measure the performance of the algorithms to correctly classify the collections. Results showed that Bisecting K-Mean as a variant of partitional algorithm performed comparably with the two best hierarchical techniques,i.e. UPGMA and CL but with much lower time complexity.
Copyrights © 2007