Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
Vol 10 No 2 (2026): April - In progress

Behavioral Analysis of Semantic Similarity Metrics under Transformer-Based Representations

Musthofa Galih Pradana (Universitas Pembangunan Nasional Veteran Jakarta)
Nindy Irzavika (Universitas Pembangunan Nasional Veteran Jakarta)
Nurhuda Maulana (Universitas Pembangunan Nasional Veteran Jakarta)
Syaila Ananta Karenina (Universitas Pembangunan Nasional Veteran Jakarta)
Salma Ashiila Rabbani (Universitas Pembangunan Nasional Veteran Jakarta)



Article Info

Publish Date
20 Apr 2026

Abstract

Knowledge extraction has several approaches such as traditional approaches that rely on lexical representation capabilities, one of which is TF-IDF whose implementation can be combined with several classic similarity metrics such as cosine similarity and dice coefficient similarity. In addition to applying the lexical representation approach, this study tries to apply it to a more modern type of representation, namely transformer-embedding-based contextual representation. The data used in this study is abstract document data of students' theses. The findings of the study show that contextual embedding changes the behavior of similarity values. The results of the analysis showed an average ranking shift of 6.70 positions. The test results showed a weak rating correlation value (Spearman = 0.22; Kendall = 0.146), and the high-ranking alignment measured with NDCG (0.97), which shows structural differences in the order of similarity between lexical and contextual representations. Other findings also show that the gap in ranking produced by the two representations used is quite far due to the difference in the mechanism and working pattern of the two representations that are far different, the selection of the type of representation must be on the characteristics of the data to be processed, if looking at the character of the text data in academic documents, the selection of contextual representations based on transformer embedding will be more suitable with contextual understanding to avoid the use of variations in words avoid plagiarism detection when applying a semantic-based representation approach.

Copyrights © 2026






Journal Info

Abbrev

RESTI

Publisher

Subject

Computer Science & IT Engineering

Description

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) dimaksudkan sebagai media kajian ilmiah hasil penelitian, pemikiran dan kajian analisis-kritis mengenai penelitian Rekayasa Sistem, Teknik Informatika/Teknologi Informasi, Manajemen Informatika dan Sistem Informasi. Sebagai bagian dari semangat ...