The shift toward a knowledge-based economy underscores the importance of intellectual property (IP) management. Unfortunately, conventional keyword-based search methods often fail to capture the semantic relationships between concepts in documents—particularly complex ones like patents and copyrights. This study proposes a topic modeling approach using the Latent Dirichlet Allocation (LDA) method to improve the relevance and accuracy of information retrieval in IP data. The research developed 76 models based on four scenarios: with and without language translation, and with and without n-gram tokenization, using topic numbers ranging from 1 to 19. The best four models from each scenario yielded coherence scores between 0.4411 and 0.4581. Evaluation using Mean Average Precision (MAP) on the top 10 documents showed that the model without translation and with unigram tokenization (10 topics) achieved the best results with an average MAP of 78%. The findings indicate that language translation and n-gram tokenization do not significantly impact the coherence score. However, models without n-gram tokenization (bigram and trigram combinations) yielded relatively more semantically relevant search results based on MAP values. Automatic translation in this study resulted in lower MAP scores compared to models without translation.
Copyrights © 2025