JURNAL MEDIA INFORMATIKA BUDIDARMA
Vol 8, No 3 (2024): Juli 2024

Analisis Perbandingan Metode Similarity untuk Kemiripan Dokumen Bahasa Indonesia pada Deteksi Kemiripan Teks Bahasa Indonesia

Pawestri, Sheraton (Unknown)
Suyanto, Yohanes (Unknown)



Article Info

Publish Date
26 Jul 2024

Abstract

Ease of accessing information brings diverse benefits, including the ability to develop models that can detect similarities between documents, a plagiarism-checking system, automatic summarization, classification, etc. The various benefits of word similarity detection make research on similarity detection between documents an important area to develop. However, studies regarding similarity detection specifically for Indonesian language documents are still relatively small and the performance can still be developed. Therefore, this research aims to conduct a comparative analysis of the performance of Doc2Vec compared to the Jaccard Coefficient, Cosine Similarity, and Euclidean Distance in detecting the similarity of documents with Indonesian text. Three datasets are used in this analysis, with the first dataset consisting of 200 news from Google News, the second dataset from IndoNLU, and the third dataset from TaPaCo. The findings from this study show that overall Cosine Similarity has better performance than Jaccard Coefficient and Euclidean Distance for average performance. The superior performance was with accuracy of 0.98, precision of 0.84, recall of 0.95, and F-1 score of 0.89, with the model formed in 10.56 seconds using the Cosine Similarity algorithm on the Google News dataset. This is because doc2vec is better suited to datasets with higher dimensions than datasets that only contain a few words.

Copyrights © 2024






Journal Info

Abbrev

mib

Publisher

Subject

Computer Science & IT Control & Systems Engineering Electrical & Electronics Engineering

Description

Decission Support System, Expert System, Informatics tecnique, Information System, Cryptography, Networking, Security, Computer Science, Image Processing, Artificial Inteligence, Steganography etc (related to informatics and computer ...