Authorship attribution (AA), a core task in computational linguistics, seeks to identify the author of a text based on stylistic patterns. While effective, many existing methods face a trade-off between classification accuracy and computational cost, especially when applied to large datasets. This study provides a systematic evaluation of word-level string kernel techniques as a highly efficient and accurate solution for AA. We investigate the performance of three string kernels (Spectrum, Presence Bits, and Intersection) paired with three machine learning classifiers (Support Vector Machine, Random Forest, and XGBoost). The models were tested on three distinct feature sets designed to isolate the stylistic contribution of noun phrases alongside word (n)-grams. Our findings reveal that the optimal configuration—a Support Vector Machine with a Spectrum kernel utilizing a feature set of word (n)-grams and noun phrases—achieves approximately 95% classification accuracy on the test set. This result underscores the critical role of phrasal-level syntactic information in capturing an author's unique voice. Most significantly, this word-level approach demonstrates a four- to six-fold reduction in model training time compared to a strong character-level baseline, while maintaining superior or competitive accuracy. This research concludes that word-level string kernels offer a powerful and practical framework for authorship attribution, striking an exceptional balance between high performance and computational efficiency. The method's scalability makes it highly suitable for real-world applications, including digital forensics, plagiarism detection, and large-scale textual analysis
Copyrights © 2025