Jo, Jaechoon
Unknown Affiliation

Published : 2 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 2 Documents
Search

Verification of a Dataset for Korean Machine Reading Comprehension with Numerical Discrete Reasoning over Paragraphs Kim, Gyeongmin; Jo, Jaechoon
JOIV : International Journal on Informatics Visualization Vol 6, No 2-2 (2022): A New Frontier in Informatics
Publisher : Society of Visual Informatics

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30630/joiv.6.2-2.1120

Abstract

Numerical reasoning in machine reading comprehension (MRC) has demonstrated significant performance improvements in the past few years. However, due to the process being restricted to specific languages, low-resource languages are not considered, and MRC studies on such languages are limited. In addition, the methods that rely on existing information extracted within the span of a paragraph have limitations in responding to questions requiring actual reasoning. To overcome these shortcomings, this study establishes a dataset for learning Korean Question and Answering (QA) models that not only answer within the span of passages but also perform numerical reasoning on passages and questions. Its efficacy was verified by training the model. We recruited eight annotators to tag the ground truth label, and they annotated datasets with 920, 115, and 115 passages in the train, dev, and test, respectively. A simple yet sophisticated automatic inter-annotation tool was created by effectively reducing the possibility of inaccuracy and error entailed by humans in the data construction process. This tool used common KoBERT and KoELECTRA. We defined four general conditions, and six conditions humans must inspect and fine-tune the pre-trained language models with numerically aware architecture. The KoELECTRA and NumNet+ with KoELECTRA were fine-tuned, and experiments in identical hyperparameter settings showed that compared with other models, the performance of NumNet+ with KoELECTRA was higher by more than 1.3 points. Our research contributes to the Korean MRC research and suggests potential and insight into MRC models capable of numerical reasoning.
Enhancing Code Similarity with Augmented Data Filtering and Ensemble Strategies Kim, Gyeongmin; Kim, Minseok; Jo, Jaechoon
JOIV : International Journal on Informatics Visualization Vol 6, No 3 (2022)
Publisher : Society of Visual Informatics

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30630/joiv.6.3.1259

Abstract

Although COVID-19 has severely affected the global economy, information technology (IT) employees managed to perform most of their work from home. Telecommuting and remote work have promoted a demand for IT services in various market sectors, including retail, entertainment, education, and healthcare. Consequently, computer and information experts are also in demand. However, producing IT, experts is difficult during a pandemic owing to limitations, such as the reduced enrollment of international students. Therefore, researching increasing software productivity is essential; this study proposes a code similarity determination model that utilizes augmented data filtering and ensemble strategies. This algorithm is the first automated development system for increasing software productivity that addresses the current situation—a worldwide shortage of software dramatically improves performance in various downstream natural language processing tasks (NLP). Unlike general-purpose pre-trained language models (PLMs), CodeBERT and GraphCodeBERT are PLMs that have learned both natural and programming languages. Hence, they are suitable as code similarity determination models. The data filtering process consists of three steps: (1) deduplication of data, (2) deletion of intersection, and (3) an exhaustive search. The best mating (BM) 25 and length normalization of BM25 (BM25L) algorithms were used to construct positive and negative pairs. The performance of the model was evaluated using the 5-fold cross-validation ensemble technique. Experiments demonstrate the effectiveness of the proposed method quantitatively. Moreover, we expect this method to be optimal for increasing software productivity in various NLP tasks.