The development of digital technology has had a significant impact across various fields, including education and the management of scientific documents. The ease of access to online journals has introduced a new challenge—an increase in the potential for plagiarism. To address this issue, an automated system capable of detecting document similarity quickly and accurately is required. This study aims to develop a plagiarism detection system based on Cosine Similarity and Bidirectional Encoder Representations from Transformers (BERT). The research stages include text preprocessing, word weighting using Term Frequency–Inverse Document Frequency (TF-IDF), Cosine Similarity computation, BERT model training, and model performance evaluation. The results show that integrating BERT with TF-IDF significantly improves performance compared to using BERT alone. Based on the experiments, the BERT model with TF-IDF achieved the highest accuracy of 0.9621 in a 10:90 data split scenario, with a precision of 0.8141, recall of 0.7302, and F1-score of 0.8022. Meanwhile, the BERT model without TF-IDF only achieved an accuracy of 0.8529. The application of Cosine Similarity with a threshold value of 0.6 also proved effective in identifying plagiarized and non-plagiarized documents. These findings demonstrate that combining BERT and TF-IDF enhances the accuracy of plagiarism detection systems by simultaneously capturing semantic context and word weighting.
Copyrights © 2025