Lontar Komputer: Jurnal Ilmiah Teknologi Informasi
Vol 13 No 3 (2022): Vol. 13, No. 3 December 2022

Balinese Script Recognition Using Tesseract Mobile Framework

Gede Indrawan (Universitas Pendidikan Ganesha)
Ahmad Asroni (Department of Electrical Engineering and Computer Science, Universitas Pendidikan Ganesha)
Luh Joni Erawati Dewi (Department of Electrical Engineering and Computer Science, Universitas Pendidikan Ganesha)
I Gede Aris Gunadi (Department of Electrical Engineering and Computer Science, Universitas Pendidikan Ganesha)
I Ketut Paramarta (Department of Balinese Language Education, Universitas Pendidikan Ganesha)



Article Info

Publish Date
25 Nov 2022

Abstract

One of the main factors causing the decline in the use of Balinese Script is that Balinese people are less interested in reading Balinese Script because of their reluctance to learn Balinese Script, which is relatively complicated in the recognition process. The development of computer technology has now been used to help by performing character recognition or known as Optical Character Recognition (OCR). Developing the OCR application for Balinese Script is an effort to help preserve, from the technology side, as a means of education related to Balinese Script. In this study, that development was conducted by using a Tesseract OCR engine that consists of several stages, i.e., the first one is to prepare the dataset, the second one is to generate the dataset using the Web Scraping method, the third one is to train the OCR engine using the generated dataset, and finally, the fourth one is to implement the generated language model into a mobile-based application. The study results prove that the dataset generation process using the Web Scraping method can be a better choice when faced with a training dataset that requires a large dataset compared to several previous studies of non-Latin character recognition. In those studies, the jTessBox tools were used, which took time because they had to select per character for a dataset. The best result of the language model is a combination of character, word, sentence, and paragraph datasets (hierarchical combination of character, word, sentence, and paragraph datasets) with a coincidence rate of 66.67%. The more diverse and structured hierarchical datasets used, the higher the coincidence rate.

Copyrights © 2022






Journal Info

Abbrev

lontar

Publisher

Subject

Computer Science & IT

Description

Lontar Komputer [ISSN Print 2088-1541] [ISSN Online 2541-5832] is a journal that focuses on the theory, practice, and methodology of all aspects of technology in the field of computer science and engineering as well as productive and innovative ideas related to new technology and information ...