Lingua : Journal of Linguistics and Language
Vol. 3 No. 3 (2025): September 2025

Genre Aware Language Modeling for Indonesian Academic Writing: Building and Evaluating IndoSciBERT

Aribowo, Eric Kunto (Unknown)
Prima, Anggra (Unknown)



Article Info

Publish Date
30 Sep 2025

Abstract

This study introduces a genre-annotated academic corpus for Indonesian and evaluates IndoSciBERT, a domain-specific NLP model trained on this resource. To address the scarcity of rhetorical datasets in low-resource languages, we compiled a 52,300-document corpus from DOAJ and SINTA-indexed journals (2015–2025) and annotated 5,200 paragraphs using the CARS and Argumentative Zoning frameworks. IndoSciBERT was then fine-tuned for rhetorical classification. We employed GROBID for PDF to TEI conversion, TEITOK for annotation, and SIPEBI/KBBI for spelling normalization. The IndoSciBERT model was benchmarked against IndoBERT on rhetorical classification tasks. IndoSciBERT achieved an F1 score of 0.82 and an accuracy of 84.2%, outperforming the baseline model and showing strong reliability in distinguishing rhetorical moves. These results affirm the value of domain-specific modeling for educational applications. The annotated corpus not only supports genre analysis, pedagogy, and automated writing feedback, but also establishes a foundation for inclusive NLP. In particular, this work makes a distinct contribution by offering a sustainable path to enhance academic literacy in Bahasa Indonesia through intelligent, genre-aware tools.

Copyrights © 2025






Journal Info

Abbrev

lingua

Publisher

Subject

Languange, Linguistic, Communication & Media

Description

Lingua : Journal of Linguistics and Language with ISSN Number 3032-3304 (Online) published by Indonesian Scientific Publication, is a leading scholarly journal that has undergone a rigorous peer-review process and is committed to open access publication. Established to advance the field of ...