This study aims to evaluate the effectiveness of a large Japanese language model, ClueAI, tailored to the medical domain, in the task of predicting Japanese medical texts. The background of this study is the limitations of general language models, including multilingual models such as multilingual BERT, in handling linguistic complexity and specific terminology in Japanese medical texts. The research methodology includes fine-tuning the ClueAI model using the MedNLP corpus, with a MeCab-based tokenization approach through the Fugashi library. The evaluation is carried out using the perplexity metric to measure the model's generalization ability in predicting texts probabilistically. The results show that ClueAI that has been tailored to the medical domain produces lower perplexity values than the multilingual BERT baseline, and is better able to understand the context and sentence structure of medical texts. MeCab-based tokenization is proven to contribute significantly to improving prediction accuracy through more precise morphological analysis. However, the model still shows weaknesses in handling complex syntactic structures such as passive sentences and nested clauses. This study concludes that domain adaptation provides improved performance, but limitations in linguistic generalization remain a challenge. Further research is recommended to explore models that are more sensitive to syntactic structures, expand the variety of medical corpora, and apply other Japanese language models in broader medical NLP tasks such as clinical entity extraction and classification.
Copyrights © 2025