Knowledge extraction has several approaches such as traditional approaches that rely on lexical representation capabilities, one of which is TF-IDF whose implementation can be combined with several classic similarity metrics such as cosine similarity and dice coefficient similarity. In addition to applying the lexical representation approach, this study tries to apply it to a more modern type of representation, namely transformer-embedding-based contextual representation. The data used in this study is abstract document data of students' theses. The findings of the study show that contextual embedding changes the behavior of similarity values. The results of the analysis showed an average ranking shift of 6.70 positions. The test results showed a weak rating correlation value (Spearman = 0.22; Kendall = 0.146), and the high-ranking alignment measured with NDCG (0.97), which shows structural differences in the order of similarity between lexical and contextual representations. Other findings also show that the gap in ranking produced by the two representations used is quite far due to the difference in the mechanism and working pattern of the two representations that are far different, the selection of the type of representation must be on the characteristics of the data to be processed, if looking at the character of the text data in academic documents, the selection of contextual representations based on transformer embedding will be more suitable with contextual understanding to avoid the use of variations in words avoid plagiarism detection when applying a semantic-based representation approach.
Copyrights © 2026