Video transcription can be obtained automatically based on the original language translation of the video maker's speech, but the quality of the transcription depends on the quality of the audio signal and the natural voice of the speaker. In this study, Deep Speech is used to predict letters based on acoustic recognition without understanding language rules. The Common Voice multilingual corpus helps Deep Seech to transcribe Indonesian. However, this corpus does not accommodate the special topic of urban agriculture, so an additional corpus is needed to build acoustic and language models with the urban agriculture domain. A total of 15 popular videos with closed captions and nine E-Books with the theme of Horticulture (fruit, vegetables and medicinal plants) were curated. The video data were extracted into audio and transcription according to specifications as training data, while the agricultural text data were transformed into language models, which were used to predict recognition results. The evaluation results show that the number of epochs has an effect on improving the transcription performance. The language model score used during prediction improved WER performance as it interpreted words with agricultural terms. Another finding was that the model was unable to predict short words with informal varieties and located at the end of the sentence.
Copyrights © 2022