This study aims to develop a system for detecting similarities in thesis titles and content to prevent plagiarism and support student originality. The high level of similarity in final projects is a significant concern in academic environments. Two text vectorization methods, Latent Semantic Analysis (LSA) and Doc2Vec, were compared to measure document similarity. Results showed that LSA achieved a very high cosine similarity (99.94%) due to dimensionality reduction that preserved semantic correlations. In contrast, Doc2Vec produced lower similarity scores, with 7.17% for PV-DM and 39.07% for PV-DBOW, indicating richer text representations. This study adopted the CRISP-DM model, which includes Business Understanding, Data Understanding, Data Preparation, Modelling, and Evaluation. The model is expected to strengthen academic integrity and encourage valuable scientific contributions.
Copyrights © 2024