Claim Missing Document
Check
Articles

Found 2 Documents
Search
Journal : Scientific Journal of Informatics

Integrating C4.5 and K-Nearest Neighbor Imputation with Relief Feature Selection for Enhancing Breast Cancer Diagnosis Purwinarko, Aji; Budiman, Kholiq; Widiyatmoko, Arif; Sasi, Fitri Arum; Hardyanto, Wahyu
Scientific Journal of Informatics Vol. 12 No. 1: February 2025
Publisher : Universitas Negeri Semarang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.15294/sji.v12i1.21673

Abstract

Purpose: Breast cancer remains a significant cause of mortality among women, requiring accurate diagnostic methods. Traditional classification models often face accuracy challenges due to missing values and irrelevant features. This investigation advances the classification of breast cancer through the amalgamation of the C4.5 algorithm with K-Nearest Neighbor (KNN) imputation and Relief feature selection methodologies, thereby augmenting data integrity and enhancing classification efficacy. Methods: The Wisconsin Breast Cancer Database (WBCD) was the core reference for evaluating the proposed methodology. KNN imputation addressed missing values, while Relief selected the most relevant features. The C4.5 algorithm executed training by utilizing data segregations in the corresponding proportions of 70:30, 80:20, and 90:10, with its efficiency gauged through a range of metrics, particularly accuracy, precision, recall, and F1-score. Result: This innovative methodology achieved the highest classification accuracy of 98.57%, surpassing several existing models. Particularly noteworthy, the strategy being analyzed exhibited remarkable success relative to PSO-C4.5 (96.49%), EBL-RBFNN (98.40%), Gaussian Naïve Bayes (97.50%), and t-SNE (98.20%), demonstrating associated advancements of 2.08%, 0.17%, 1.07%, and 0.37%. These results confirm its effectiveness in handling missing values and selecting relevant features. Novelty: Unlike prior studies that addressed missing values and feature selection separately, this research integrates both techniques, enhancing classification accuracy and computational efficiency. The findings suggest that this approach provides a reliable breast cancer diagnosis method. Future work could explore deep learning integration and validation on larger datasets to improve generalizability.
Improving Sentiment Analysis with a Context-Aware RoBERTa–BiLSTM and Word2Vec Branch Hardyanto, Wahyu; Aryani, Nila Prasetya; Andestian, Defin; Sugiyanto; Setyaningrum, Wahyu; Mardiansyah, M Fadil; Islam, Muhamad Anbiya Nur; Purwinarko, Aji
Scientific Journal of Informatics Vol. 12 No. 4: November 2025
Publisher : Universitas Negeri Semarang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.15294/sji.v12i4.35918

Abstract

Purpose: We improve the accuracy of Twitter/X sentiment analysis with a hybrid model combining Word2Vec and the Robustly Optimized BERT Pretraining Approach (RoBERTa). However, Twitter/X text is noisy (slang/OOV) and ambiguous, so the performance of the pre-trained transformer decreases. Word2Vec is also limited to local contexts. Integrative studies of both are still limited. The idea is that Word2Vec is strong for slang/novel vocabulary (distributional semantics), while RoBERTa excels in contextual meaning; combining the two mitigates each other's weaknesses. Methods: The Sentiment140 dataset contains 1.6 million balanced tweets. The split is stratified; Word2Vec is trained solely on the training data. RoBERTa is pretrained (frozen in the first stage, then fine-tuned with some layers in the second stage). The Word2Vec and RoBERTa vectors are concatenated and processed using Bidirectional Long Short-Term Memory (BiLSTM) with sigmoid activation. Training utilizes TensorFlow and the Adam optimizer, incorporating dropout and early stopping. The decision threshold is optimized during the validation process. Result: The hybrid model achieved an accuracy of 88.09%, an F1-score of 88.09%, and an Area Under the Curve (AUC) ≈ 95.19% on the Receiver Operating Characteristic (ROC). No overfitting was observed, and the hybrid model outperformed both single baselines. The confusion matrix and ROC curve corroborate the findings. Novelty: The novelty lies in the fusion of distributional and contextual representations with a structured fusion mechanism. Limitations: Computational requirements and hyperparameter tuning are not yet extensive. Further directions: Systematic hyperparameter search and cross-validation across other large sentiment datasets to assess generalization.