Yayat Sudaryat
Universitas Pendidikan Indonesia, Indonesia

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Constructing a Part-of-Speech Tagging based on Lexicon and Rule-based for Sundanese Corpus Ade Sutedi; Ayu Latifah; Novan Rodiansyah; Yayat Sudaryat
Jurnal Teknik Informatika (Jutif) Vol. 7 No. 3 (2026): JUTIF Volume 7, Number 3, June 2026
Publisher : Informatika, Universitas Jenderal Soedirman

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.52436/1.jutif.2026.7.3.5361

Abstract

Part-of-Speech (POS) Tagging is the process of annotating word classes (nouns, verbs, adjectives, etc.) in a sentence, which is used as a basis for natural language processing and artificial intelligence. In this study, a corpus of word classes and word class annotating rules for the Sundanese language, which has limited resources, was developed. The experiments were conducted on an annotated corpus consisting of 104,696 tokens collected from Sundanese dictionaries, Sundanese Literature (Carita Pondok, Guguritan, Mantra, Pupujian, Sisindiran, Sajak, and Wawacan), Babasan and Paribasa, and social media X (Twitter). The annotation process is carried out in several stages that combine manual annotation based on cross-lingual transfer from Indonesian POS to Sundanese POS, then adjusted based on the word class rules in Sundanese. The results of this study are a POS annotation corpus containing Sundanese word-tag pairs and a basic rule-based model compared to the HMM and CRF models. The rule-based model achieves an F1-score of 0.867, the CRF model achieves an F1-score of 0.889, while the HMM model attains the highest score with an F1-score of 1.000. Analysis of POS distributions reveals that nouns (KB) consistently dominate across all models, reflecting the noun-rich nature of Sundanese literary texts. It also highlights the challenges of handling unknown words and the need for richer annotated resources, which are related to tag interoperability with Universal POS standards. This research contributes to the development of NLP resources for low-resource languages and provides a methodological foundation for future Sundanese NLP applications.