Background: The distinction between standard and non-standard Indonesian sentences is traditionally well-defined, yet the ubiquity of digital communication has increasingly blurred these boundaries. This convergence introduces significant lexical ambiguity in formal contexts, complicating the performance of automated text classification systems. Objective: This study aims to enhance the robustness of Support Vector Machine (SVM) classification by addressing these linguistic irregularities through TF-IDF vectorization and a targeted directional augmentation strategy. Methods: A corpus comprising 5,394 labeled sentences was processed under a strict anti-leak grouping strategy to rigorously prevent semantic leakage between training, validation, and testing sets. To resolve decision boundary overlaps often missed by the baseline model, manual directional augmentation was applied, specifically targeting ambiguous sentence structures to enrich the training distribution and linguistic diversity. Results: The experiments demonstrated that directional augmentation significantly refined the model's decision margins. While the baseline model achieved a test accuracy of 94.39%, the augmented approach substantially improved generalization capabilities across unseen groups, elevating validation accuracy from 96.11% to 97.39% and test accuracy to 96.16%. Conclusion: These findings substantiate that structurally enriching the dataset effectively mitigates overfitting and improves sensitivity. However, given the scalability constraints of manual intervention, future research should prioritize automated augmentation techniques and contextual embeddings to handle deep linguistic nuances further.
Copyrights © 2026