This study introduces a genre-annotated academic corpus for Indonesian and evaluates IndoSciBERT, a domain-specific NLP model trained on this resource. To address the scarcity of rhetorical datasets in low-resource languages, we compiled a 52,300-document corpus from DOAJ and SINTA-indexed journals (2015–2025) and annotated 5,200 paragraphs using the CARS and Argumentative Zoning frameworks. IndoSciBERT was then fine-tuned for rhetorical classification. We employed GROBID for PDF to TEI conversion, TEITOK for annotation, and SIPEBI/KBBI for spelling normalization. The IndoSciBERT model was benchmarked against IndoBERT on rhetorical classification tasks. IndoSciBERT achieved an F1 score of 0.82 and an accuracy of 84.2%, outperforming the baseline model and showing strong reliability in distinguishing rhetorical moves. These results affirm the value of domain-specific modeling for educational applications. The annotated corpus not only supports genre analysis, pedagogy, and automated writing feedback, but also establishes a foundation for inclusive NLP. In particular, this work makes a distinct contribution by offering a sustainable path to enhance academic literacy in Bahasa Indonesia through intelligent, genre-aware tools.
Copyrights © 2025