Claim Missing Document
Check
Articles

Found 27 Documents
Search

Chi-Square Feature Selection with Pseudo-Labelling in Natural Language Processing Afriyani, Sintia; Surono, Sugiyarto; Solihin, Iwan Mahmud
JTAM (Jurnal Teori dan Aplikasi Matematika) Vol 8, No 3 (2024): July
Publisher : Universitas Muhammadiyah Mataram

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.31764/jtam.v8i3.22751

Abstract

This study aims to evaluate the effectiveness of the Chi-Square feature selection method in improving the classification accuracy of linear Support Vector Machine, K-Nearest Neighbors and Random Forest in natural language processing when combined with classification algorithms as well as introducing Pseudo-Labelling techniques to improve semi-supervised classification performance. This research is important in the context of NLP as accurate feature selection can significantly improve model performance by reducing data noise and focusing on the most relevant information, while Pseudo-Labelling techniques help maximise unlabelled data, which is particularly useful when labelled data is sparse. The research methodology involves collecting relevant datasets, thus applying the Chi-Square method to filter out significant features, and applying Pseudo-Labelling techniques to train semi-supervised models. In this study, the dataset used in this research is the text data of public comments related to the 2024 Presidential General Election, which is obtained from the Twitter scrapping process. The characteristics of this dataset include various comments and opinions from the public related to presidential candidates, including political views, support, and criticism of these candidates. The experimental results show a significant improvement in classification accuracy to 0.9200, with precision of 0.8893, recall of 0.9200, and F1-score of 0.8828. The integration of Pseudo-Labelling techniques prominently improves the performance of semi-supervised classification, suggesting that the combination of Chi-Square and Pseudo-Labelling methods can improve classification systems in various natural language processing applications. This opens up opportunities to develop more efficient methodologies in improving classification accuracy and effectiveness in natural language processing tasks, especially in the domains of linear Support Vector Machine, K-Nearest Neighbors and Random Forest well as semi-supervised learning.
Distance Functions Study in Fuzzy C-Means Core and Reduct Clustering Eliyanto, Joko; Surono, Sugiyarto
Jurnal Ilmiah Teknik Elektro Komputer dan Informatika Vol. 7 No. 1 (2021): April
Publisher : Universitas Ahmad Dahlan

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.26555/jiteki.v7i1.20516

Abstract

Fuzzy C-Means is a distance-based clustering process which applied by fuzzy logic concept. Clustering process worked in linear to the iteration process to minimizing the objective function. The objective function is an addition of the multiplication between the coordinates distance towards their closest cluster centroid and their membership degree. The more the iteration process, the objective function should get lower and lower. The objective of this research is to observe whether the distances which usually applied are able to fulfill the aforementioned hypothesis for determining the most suitable distance for Fuzzy C-Means clustering application. Few distance function was applied in the same dataset. 5 standard datasets and 2 random datasets were used to test the fuzzy c-means clustering performance with the 7 different distance function. Accuracy, purity, and Rand Index also applied to measure the quality of the resulted cluster. The observation result depicted that the distance function which resulted in the best quality of clusters are Euclidean, Average, Manhattan, Minkowski, Minkowski-Chebisev, and Canberra distance. These 6 distances were able to fulfill the basic hypothesis of the objective function behavior on Fuzzy C-Means Clustering method. The only distance who were not able to fulfill the basic hypothesis is Chebisev distance.
Comparative Evaluation of Feature Selection Methods for Heart Disease Classification with Support Vector Machine Bidul, Winarsi J.; Surono, Sugiyarto; Kurniawan, Tri Basuki
Jurnal Ilmiah Teknik Elektro Komputer dan Informatika Vol. 10 No. 2 (2024): June
Publisher : Universitas Ahmad Dahlan

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.26555/jiteki.v10i2.28647

Abstract

The purpose of this study is to compare the effectiveness of a variety of feature selection techniques to enhance the performance of Support Vector Machine (SVM) models for classifying heart disease data, particularly in the context of big data. The main challenge lies in managing large datasets, which necessitates the application of feature selection techniques to streamline the analysis process. Therefore, several feature selection methods, including Logistic Regression-Recursive Feature Elimination (LR-RFE), Logistic RegressionSequential Forward Selection (LR-SFS), Correlation-based Feature Selection (CFS), and Variance Threshold were explored to identify the most efficient approach. Based on existing research, these methods have shown a great impact in improving classification accuracy. In this study, it was found that combining the SVM model with LR-RFE, LR-SFS, and Variance Threshold resulted in superior evaluation, achieving the highest accuracy of 89%. Based on the comparison of other evaluation results, including precision, recall, and F1-score, the performance of these models varied depending on the feature selection method chosen and the distribution of data used for training and testing. But in general, LR-RFE-SVM and Variance Threshold-SVM tend to provide better evaluation values than LR-SFS-SVM and SVM-CFS. Based on the computation time, SVM classification with the Variance Threshold method as the feature selection method obtained the fastest time of 118.1540 seconds with the number and retention of 23 important features. Therefore, it is very important to choose a suitable feature selection technique, taking into account the number of retained features and the computation time. This research underscores the significance of feature selection in addressing big data challenges, particularly in heart disease classification. In addition, this study also highlights practical implications for healthcare practitioners and researchers by recommending methods that can be integrated into real-world healthcare settings or existing clinical decision support systems.
Perbandingan 5 Jarak K-Nearest Neighbor pada Analisis Sentimen Mujhid, Almuzhidul; Thobirin, Aris; Firdausy, Salma Nadya; Surono, Sugiyarto; Rahmadani, Lanova Ade
Jurnal Ilmiah Matematika Vol 8, No 2 (2021)
Publisher : Universitas Ahmad Dahlan

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.26555/konvergensi.v0i0.23170

Abstract

K-Nearest Neighbor (KNN) merupakan algoritma yang biasa digunakan untuk klasifikasi. Penelitian ini menggunakan ulasan aplikasi Maxim di Google Play Store. Pengguna yang sudah mengunduh aplikasi Maxim berhak memberikan ulasan di Google Play Store guna berbagi informasi untuk pengguna lain. Implementasi K-Nearest Neighbor (KNN) terhadap Sentiment Analysis ulasan aplikasi Maxim dapat digunakan untuk menentukan kelas ulasan bernilai positif, neutral, atau negatif. Peneliti melakukan perbandingan 5 jarak yang berbeda untuk metode KNN yaitu jarak Euclidean, Manhattan, Minkowski, Chebyshev dan Canberra. Pengujian yang telah dilakukan memberikan hasil akurasi pada klasifikasi KNN dengan jarak yang berbeda, memberikan hasil akurasi yang berbeda-beda, yaitu jarak Euclidean  84 persen, jarak Manhattan  79 persen, jarak Minkowski 84 persen, jarak Chebyshev  7 persen dan jarak Canberra =44 persen.
Hybrid feature fusion from multiple CNN models with bayesian-optimized machine learning classifiers Rismawati, Dewi; Surono, Sugiyarto; Thobirin, Aris
Computer Science and Information Technologies Vol 6, No 3: November 2025
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/csit.v6i3.p315-325

Abstract

Information technology advancements have created big data, necessitating efficient techniques to retrieve helpful information. With its capacity to recognize and categorize patterns in data, especially the growing amount of picture data, deep learning is becoming a viable option. This research aims to develop a medical image classification model using chest X-Ray with four classes, namely Covid-19, Pneumonia, Tuberculosis, and Normal. The proposed method combines the advantages of deep learning and machine learning. Three pre-trained CNN models, VGG16, DenseNet201, and InceptionV3, extract features from images. The features generated from each model are fused to enhance the relevant information. Furthermore, principal component analysis (PCA) was applied to reduce the dimensionality of the features, and Bayesian optimization was used to optimize the hyperparameters of the machine learning algorithms support vector machine (SVM), decision tree (DT), and k-nearest neighbors (k-NN). The resulting classification model was evaluated based on accuracy, precision, recall, and F1-score. The results showed that FF-SVM, which is the proposed model, achieved an accuracy of 98.79% with precision, recall, and F1-score of 98.85%, 98.82%, and 98.84%, respectively. In conclusion, fusing feature extraction from multiple CNN models improved the classification accuracy of each machine-learning model. It provided reliable and accurate predictions for lung image diagnosis using chest X-Ray.
Fuzzy Support Vector Machine Using Function Linear Membership and Exponential with Mahanalobis Distance Sukeiti, Wiwi Widia; Surono, Sugiyarto
JTAM (Jurnal Teori dan Aplikasi Matematika) Vol 6, No 2 (2022): April
Publisher : Universitas Muhammadiyah Mataram

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.31764/jtam.v6i2.6912

Abstract

Support vector machine (SVM) is one of effective biner classification technic with structural risk minimization (SRM) principle. SVM method is known as one of successful method in classification technic. But the real-life data problem lies in the occurrence of noise and outlier. Noise will create confusion for the SVM when the data is being processed. On this research, SVM is being developed by adding its fuzzy membership function to lessen the noise and outlier effect in data when trying to figure out the hyperplane solution. Distance calculation is also being considered while determining fuzzy value because it is a basic thing in determining the proximity between data elements, which in general is built depending on the distance between the point into the real class mass center. Fuzzy support vector machine (FSVM) uses Mahalanobis distances with the goal of finding the best hyperplane by separating data between defined classes. The data used will be going over trial for several dividing partition percentage transforming into training set and testing set. Although theoretically FSVM is able to overcome noise and outliers, the results show that the accuracy of FSVM, namely 0.017170689 and 0.018668421, is lower than the accuracy of the classical SVM method, which is 0.018838348. The existence of fuzzy membership function is extremely influential in deciding the best hyperplane. Based on that, determining the correct fuzzy membership is critical in FSVM problem.
Comparative study of unsupervised anomaly detection methods on imbalanced time series data Hanifa, Riza Aulia; Thobirin, Aris; Surono, Sugiyarto
Jurnal Ilmiah Kursor Vol. 13 No. 2 (2025)
Publisher : Universitas Trunojoyo Madura

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.21107/kursor.v13i2.431

Abstract

Anomaly detection in time series data is essential, especially when dealing with imbalanced datasets such as air quality records. This study addresses the challenge of identifying point anomalies rare and extreme pollution levels within a highly imbalanced dataset. Failing to detect such anomalies may lead to delayed environmental interventions and poor public health responses. To solve this, we propose a comparative analysis of three unsupervised learning methods: K-means clustering, Isolation Forest (IForest), and Autoencoder (AE), including its LSTM variant. These algorithms are applied to monthly air quality data collected in 2023 from 2,110 cities across Asia. The models are evaluated using Area Under the Curve (AUC), Precision, Recall, and F1-score to assess their effectiveness in detecting anomalies. Results indicate that the Autoencoder and Autoencoder LSTM outperform the others with an AUC of 98.23%, followed by K-means (97.78%) and IForest (96.01%). The Autoencoder’s reconstruction capability makes it highly effective for capturing complex temporal patterns. K-means and IForest also show strong results, offering efficient and interpretable solutions for structured data. This research highlights the potential of unsupervised anomaly detection techniques for environmental monitoring and provides practical insights into handling imbalanced time series data.
SUSTAINABLE MATERIALS NATURAL COLORS THE SUNGGING WAYANG PROCESS AS A REBRANDING OF THE GENDENG BANGUNJIWO WAYANG ARTISAN COMMUNITY IN BANTUL, YOGYAKARTA Susanto, Moh. Rusnoto; Mariah, Siti; Lukitaningsih, Ambar; Surono, Sugiyarto; Kinanti, Marlita Diyah Wening; Idam, Gabriela; Azis, Septiyan Ibnu; Prasetyo, Haryanto Nur; Damayanti, Fanita; Maulida, Tasya
International Journal of Engagement and Empowerment (IJE2) Vol. 5 No. 3 (2025): International Journal of Engagement and Empowerment
Publisher : Yayasan Education and Social Center

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.53067/ije2.v5i3.239

Abstract

This community service project aims to develop natural dyes (sustainability materials) to support the wayang puppet painting process as a rebranding strategy for the artisan community in Gendeng Hamlet, Bangunjiwo, Bantul, Yogyakarta. The implementation method involves a series of activities carried out through a Participatory Action Research (PAR) approach, involving artisans, academics, and facilitators. The stages include training, mentoring, exploration of local materials (such as Indigofera leaves, tingi bark, and turmeric), and applied trials on wayang puppets. The results of the activities showed an increase in the artisans' capacity in natural dye extraction and application techniques, more aesthetically pleasing and environmentally friendly visual quality of wayang, and the formation of a new community identity with the branding “Wayang Warna Alam” (Natural Color Wayang). This innovation not only increases the added value of wayang products but also strengthens cultural and ecological sustainability and opens up market opportunities in the context of the creative economy.
Impact of Different Kernels on Breast Cancer Severity Prediction Using Support Vector Machine Mahmudah, Kunti; Surono, Sugiyarto; Rusmining, Rusmining; Indriani, Fatma
Journal of Electronics, Electromedical Engineering, and Medical Informatics Vol 8 No 1 (2026): January
Publisher : Department of Electromedical Engineering, POLTEKKES KEMENKES SURABAYA

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.35882/jeeemi.v8i1.960

Abstract

Breast cancer poses a critical global health challenge and continues to be one of the most prevalent causes of cancer-related deaths among women worldwide. Accurate and early classification of cancer severity is essential for improving treatment outcomes and guiding clinical decision-making, since timely intervention can significantly reduce mortality rates and enhance patient survival. This study evaluates the performance of Support Vector Machine (SVM) models using different kernel functions of Linear, Polynomial, Radial Basis Function (RBF), and Sigmoid for breast cancer severity prediction. The impact of feature selection was also examined, using the Random Forest algorithm to select the top features based on Mean Decrease Accuracy (MDA), which serves to reduce redundancy, improve interpretability, and enhance model efficiency. Experimental results show that the RBF kernel consistently outperformed other kernels, especially in terms of sensitivity, a critical metric in medical diagnostics that emphasizes the ability of the model to identify positive cases correctly. Without feature selection, the RBF kernel achieved an accuracy of 0.9744, a sensitivity of 0.9772, a precision of 0.9722, and an AUC of 0.9968, indicating strong performance across all evaluation metrics. After applying feature selection, the RBF kernel further improved the accuracy to 0.9754, the sensitivity to 0.9770, the precision to 0.9742, and the AUC to 0.9975, which demonstrated enhanced generalization and reduced overfitting, highlighting the benefits of targeted feature reduction. While the Polynomial kernel yielded the highest precision (up to 0.9799), its lower sensitivity (as low as 0.9237) indicates a greater risk of false negatives, which is particularly concerning in cancer detection. These findings underscore the importance of optimizing both kernel function and feature selection. The RBF kernel, when combined with targeted feature selection, offers the most balanced and sensitive model, making it highly suitable for breast cancer classification tasks where diagnostic accuracy is vital
Algoritma Support Vector Regression dan Analisis Long Short-Term Memory sebagai Penanganan Missing Data Parinzka, Zellya; Surono, Sugiyarto; Thobirin, Aris
Jurnal Teknologi Informasi dan Ilmu Komputer Vol 13 No 1: Februari 2026
Publisher : Fakultas Ilmu Komputer, Universitas Brawijaya

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.25126/jtiik.2026131

Abstract

Time series multivariat adalah jenis data yang sering digunakan di berbagai bidang seperti keuangan, statistik, dan kesehatan karena dapat menjelaskan hubungan kompleks antar variabel. Namun, sering kali terdapat masalah seperti missing data yang dapat menjadi tantangan signifikan dalam proses analisis, mengurangi kualitas data dan akurasi model prediksi. Penelitian ini bertujuan untuk mengatasi masalah missing data time series multivariat dengan menggunakan teknik Support Vector Regression (SVR) untuk imputasi missing data dan Long Short-Term Memory (LSTM) sebagai analisis prediktif. SVR diterapkan untuk memprediksi missing data berdasarkan hubungan antar variabelnya, sementara LSTM digunakan untuk memodelkan pola temporal dalam data yang telah diimputasi. Evaluasi kinerja menunjukkan bahwa metode ini dapat meningkatkan kualitas data dan akurasi prediksi secara signifikan. Dengan menghasilkan metrik evaluasi RMSE 0.16, MSE 0.03, dan MAE 0.13, metode integratif ini tidak hanya menawarkan solusi yang efektif untuk menangani missing data, tetapi juga membantu memperkuat penerapan machine learning dalam analisis data time series multivariat. Selain itu, penelitian ini menunjukkan relevansi praktis dari integritas metode imputasi berbasi SVR dan analisis prediktif dengan LSTM, yang mampu dalam meningkatkan integritas data serta menghasilkan model prediksi yang akurat, sehingga berpotensi mendukung pengambilan keputusan berbasis data dalam berbagai bidang yang lebih luas dan realistis, khususnya pada analisis indikator kesehatan seperti Life Expectancy.   Abstract Multivariate time series is a type of data often used in various fields such as finance, statistics, and health because it can explain complex relationships between variables. However, there are often issues like missing data that can pose significant challenges in the analysis process, reducing data quality and model prediction accuracy. This research aims to address the missing data problem in multivariate time series by using Support Vector Regression (SVR) for imputing missing data and Long Short-Term Memory (LSTM) for predictive analysis. SVR is applied to predict missing data based on the relationships between the variables, while LSTM is used to model temporal patterns in the imputed data. Performance evaluation shows that this method can significantly improve data quality and prediction accuracy. With evaluation metrics of RMSE 0.16, MSE 0.03, and MAE 0.13, this integrative method not only offers an effective solution for handling missing data but also helps strengthen the application of machine learning in multivariate time series data analysis. Furthermore, this research demonstrates the practical relevance of the integrity of SVR-based imputation methods and predictive analysis with LSTM, which can enhance data integrity and produce accurate predictive models, thus potentially supporting data-driven decision-making in broader and more realistic fields, particularly in the analysis of health indicators such as Life Expectancy.