This study presents a max-margin–based approach for sentence boundary segmentation in Indonesian paragraphs, addressing a persistent challenge in Natural Language Processing applications. Conventional rule-based or sequential methods often struggle to distinguish ambiguous punctuation marks, particularly in contexts involving abbreviations, numerical expressions, hierarchical sentence structures, and direct quotations. To overcome these limitations, this research formulates sentence segmentation as a paragraph parsing task, enabling the model to capture both local boundary cues and global structural patterns within a paragraph. A manually annotated corpus of 12,000 paragraphs from news articles, public documents, and academic texts was developed to provide diverse linguistic structures and punctuation variations. The proposed model integrates local punctuation features, structural constraints from the Indonesian EYD standard, and global paragraph coherence through a max-margin discriminative parsing framework. Experimental results show that the model achieves strong performance on the test set, with a precision of 0.93, recall of 0.91, and F1-score of 0.92, significantly outperforming a rule-based baseline. Error analysis further highlights improvements in handling ambiguous cases such as abbreviations, numerical formatting, and direct quotations with nested punctuation. The findings demonstrate that a structured max-margin approach delivers more reliable sentence boundary segmentation and can enhance downstream NLP tasks requiring accurate sentence-level text processing.
Copyrights © 2025