Journal of Dinda : Data Science, Information Technology, and Data Analytics
Vol 4 No 2 (2024): August

Document Similarity Using Term Frequency-Inverse Document Frequency Representation and Cosine Similarity

Adi Widianto (Universitas Pertiba)
Eka Pebriyanto (Universitas Pertiba)
Fitriyanti Fitriyanti (Unknown)
Marna Marna (Unknown)



Article Info

Publish Date
12 Aug 2024

Abstract

Document similarity is a fundamental task in natural language processing and information retrieval, with applications ranging from plagiarism detection to recommendation systems. In this study, we leverage the term frequency-inverse document frequency (TF-IDF) to represent documents in a high-dimensional vector space, capturing their unique content while mitigating the influence of common terms. Subsequently, we employ the cosine similarity metric to measure the similarity between pairs of documents, which assesses the angle between their respective TF-IDF vectors. To evaluate the effectiveness of our approach, we conducted experiments on the Document Similarity Triplets Dataset, a benchmark dataset specifically designed for assessing document similarity techniques. Our experimental results demonstrate a significant performance with an accuracy score of 93.6% using bigram-only representation. However, we observed instances where false predictions occurred due to paired documents having similar terms but differing semantics, revealing a weakness in the TF-IDF approach. To address this limitation, future research could focus on augmenting document representations with semantic features. Incorporating semantic information, such as word embeddings or contextual embeddings, could enhance the model's ability to capture nuanced semantic relationships between documents, thereby improving accuracy in scenarios where term overlap does not adequately signify similarity.

Copyrights © 2024






Journal Info

Abbrev

dinda

Publisher

Subject

Computer Science & IT

Description

Journal of Dinda : Data Science, Information Technology, and Data Analytics as a publication media for research results in the fields of Data Science, Information Technology, and Data Analytics, but not implicitly limited. Published 2 times a year in February and August. The journal is managed by ...