JUTEI (Jurnal Terapan Teknologi Informasi)
Vol 6 No 2 (2022): Jurnal Terapan Teknologi Informasi

Penerapan Simhash dan Hamming distance dalam Deteksi kemiripan Teks Berita

Mayesti Anggelina (Informatika, Universitas Kristen Duta Wacana)
Lucia Dwi Krisnawati (Informatika, Universitas Kristen Duta Wacana)
Danny Sebastian (Informatika, Universitas Kristen Duta Wacana)



Article Info

Publish Date
31 Oct 2022

Abstract

Text reuse is defined as the reuse of existing written sources for creating a new text. The degree of reuse varies from duplicate, near-duplicate to topically similar text. Though some genres of text reuse are acceptable, their existence causes inefficiency of searching and waste of storage. To overcome this problem, a textual similarity detection system is needed. This study focuses on detecting the text similarity by applying the Simhash algorithm. It is used to create document fingerprints which function as document features through which the degree of text similarity can be compared. The similarity of a suspicious text to the source documents are measured then by Hamming Distance. Focusing on the duplicate and near-duplicate detection, the experiments conducted show that the recall of the duplicate detection  reaches 80%, meaning that the system is capable of retrieving the duplicate sources of the suspicious document.

Copyrights © 2022






Journal Info

Abbrev

jurnal

Publisher

Subject

Computer Science & IT

Description

Jurnal Terapan Teknologi Informasi (JUTEI) is a journal focusing on theory, practice, and methodology of all aspects in Information Technology and Computer Science, as well as productive and innovative ideas related to new technology and applied sciences. This journal is managed by the Faculty of ...