Adi Widianto
Universitas Pertiba

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Document Similarity Using Term Frequency-Inverse Document Frequency Representation and Cosine Similarity Adi Widianto; Eka Pebriyanto; Fitriyanti Fitriyanti; Marna Marna
Indonesian Journal of Data Science, IoT, Machine Learning and Informatics Vol 4 No 2 (2024): August
Publisher : Research Group of Data Engineering, Faculty of Informatics

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.20895/dinda.v4i2.1589

Abstract

Document similarity is a fundamental task in natural language processing and information retrieval, with applications ranging from plagiarism detection to recommendation systems. In this study, we leverage the term frequency-inverse document frequency (TF-IDF) to represent documents in a high-dimensional vector space, capturing their unique content while mitigating the influence of common terms. Subsequently, we employ the cosine similarity metric to measure the similarity between pairs of documents, which assesses the angle between their respective TF-IDF vectors. To evaluate the effectiveness of our approach, we conducted experiments on the Document Similarity Triplets Dataset, a benchmark dataset specifically designed for assessing document similarity techniques. Our experimental results demonstrate a significant performance with an accuracy score of 93.6% using bigram-only representation. However, we observed instances where false predictions occurred due to paired documents having similar terms but differing semantics, revealing a weakness in the TF-IDF approach. To address this limitation, future research could focus on augmenting document representations with semantic features. Incorporating semantic information, such as word embeddings or contextual embeddings, could enhance the model's ability to capture nuanced semantic relationships between documents, thereby improving accuracy in scenarios where term overlap does not adequately signify similarity.