Measuring the semantic similarity of research titles is a crucial component in maintaining academic originality and preventing topic duplication in higher education. However, IndoBERT embeddings, as a pretrained Indonesian language model, are known to suffer from anisotropy, causing many titles to exhibit high similarity scores despite being semantically distinct. This study aims to optimize the quality of IndoBERT embeddings through Ditto Whitening and to evaluate its impact on research title similarity measurement. The dataset comprises 7.785 undergraduate thesis titles collected from six disciplinary domains and processed using mean pooling and L2 normalization before and after whitening. An intrinsic evaluation was conducted by assessing embedding isotropy, cosine similarity distribution, global bias toward the mean vector, and hubness phenomena, supported by embedding space visualizations using t-SNE, UMAP, and cosine similarity heatmaps. Experimental results demonstrate substantial improvements in embedding quality, indicated by a reduction in Cosine Pair Mean from 0.559 to −0.000145, a decrease in MeanCos-to-Mean from 0.748 to 0.0068, and a reduction in Hubness Skew from 1.60 to 0.68. The isotropy of the embeddings also increased markedly, reflecting a more uniform vector distribution. These findings confirm that Ditto Whitening effectively improves the isotropy of IndoBERT embeddings and directly enhances the accuracy of research title similarity detection and academic document retrieval systems, thereby supporting topic management and research quality assurance in higher education.
Copyrights © 2026