Claim Missing Document
Check
Articles

Comparison of clustering analysis of K-means, K-medoids, and fuzzy C-means methods: case study of school accreditation in west java Hasnataeni, Yunia; Nurhambali, M Rizky; Ardhani, Rizky; Hafsah, Siti; Soleh, Agus M
Journal of Soft Computing Exploration Vol. 6 No. 2 (2025): June 2025
Publisher : SHM Publisher

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.52465/joscex.v6i2.575

Abstract

This research aims to analyze school accreditation data in West Java using clustering methods: K-Means, K-Medoids, and Fuzzy C-Means, to identify patterns and groups of schools based on similar characteristics. K-Means, known for its simplicity, suggests an optimal two-cluster solution based on silhouette values but employs three clusters for detailed analysis. K-Medoids, noted for its robustness against outliers, achieves the best clustering with a lowest Davies-Bouldin Index (DBI) of 0.8 and the highest Silhouette Information (SI) value of 0.46. Fuzzy C-Means, which assigns membership degrees to each data point across clusters, performs reasonably well with a DBI of 0.87 and an SI value of 0.40, while K-Means shows the highest DBI of 0.9 and the lowest SI value of 0.39. The findings highlight K-Medoids as the superior method for clustering. Regions with lower educational quality, such as Bekasi and Cianjur regions, require priority interventions, whereas areas with better quality, like Bandung and Bekasi regions, can serve as models. Data-driven approaches, inter-regional collaboration, and continuous monitoring and evaluation are recommended to optimize educational policies and enhance overall educational quality in West Java.
FUNCTION GROUP SELECTION OF SEMBUNG LEAVES (BLUMEA BALSAMIFERA) SIGNIFICANT TO ANTIOXIDANTS USING OVERLAPPING GROUP LASSO kusnaeni, kusnaeni; Soleh, Agus M; Afendi, Farit M; Sartono, Bagus
BAREKENG: Jurnal Ilmu Matematika dan Terapan Vol 16 No 2 (2022): BAREKENG: Jurnal Ilmu Matematika dan Terapan
Publisher : PATTIMURA UNIVERSITY

Show Abstract | Download Original | Original Source | Check in Google Scholar | Full PDF (476.663 KB) | DOI: 10.30598/barekengvol16iss2pp721-728

Abstract

Functional groups of sembung leaf metabolites can be detected using FTIR spectrometry by looking at the spectrum's shape from specific peaks that indicate the type of functional group of a compound. There were 35 observations and 1866 explanatory variables (wavelength) in this study. The number of explanatory variables more than the number of observations is high-dimensional data. One method that can be used to analyze high-dimensional data is penalized regression. The overlapping group lasso method is a development of the group-based penalized regression method that can solve the problem of selecting variable groups and members of overlapping groups of variables. The results of selecting the variable groups using the overlapping group lasso method found that the functional groups that were significant for the antioxidants of sembung leaves were C=C Unstructured, CN amide, Polyphenol, Sio2.
TEXT CLUSTERING ONLINE LEARNING OPINION DURING COVID-19 PANDEMIC IN INDONESIA USING TWEETS Tyas, Maulida Fajrining; Kurnia, Anang; Soleh, Agus Mohamad
BAREKENG: Jurnal Ilmu Matematika dan Terapan Vol 16 No 3 (2022): BAREKENG: Journal of Mathematics and Its Applications
Publisher : PATTIMURA UNIVERSITY

Show Abstract | Download Original | Original Source | Check in Google Scholar | Full PDF (1033.901 KB) | DOI: 10.30598/barekengvol16iss3pp939-948

Abstract

To prevent the spread of corona virus, restriction of social activities are implemented including school activities which reaps the pros and cons in community. Opinions about online learning are widely conveyed mainly on Twitter. Tweets obtained can be used to extract information using text clustering to group topics about online learning during pandemic in Indonesia. K-Means is often used and has good performance in text clustering area. However, the problem of high dimensionality in textual data can result in difficult computations so that a sampling method is proposed. This paper aims to examine whether a sampling method to cluster tweets can result to an efficient clustering than using the whole dataset. After pre-processing, five sample sizes are selected from 28300 tweets which are 250, 500, 2500, 10000 and 20000 to conduct K-Means clustering. Results showed that from 10 iterations, three main cluster topics appeared 90%-100% in sample size of 2500, 10000 and 20000. Meanwhile sample size of 250 and 500 tend to produced 20%-60% appearance of the three main cluster topics. This means that around 8% to 35% of tweets used can yield representative clusters and efficient computation which is four times faster than using entire dataset.
Performance Comparison of Random Forest and XGBoost Optimized with Cuckoo Search Algorithm for Coconut Milk Adulteration Detection Using FTIR Spectroscopy I Gusti Ngurah, Sentana Putra; Kusman Sadik; Agus Mohamad Soleh; Cici Suhaeni
Journal of Mathematics, Computations and Statistics Vol. 8 No. 2 (2025): Volume 08 Nomor 02 (Oktober 2025)
Publisher : Jurusan Matematika FMIPA UNM

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.35580/jmathcos.v8i2.7817

Abstract

Coconut milk has emerged as a strategic food commodity in the global tropical region, with market demand growing at 7.2% per annum since 2021. This increasing demand has led to sophisticated adulteration practices, including dilution with water. Such adulteration not only reduces the nutritional value but also poses serious health risks, including food poisoning and allergic reactions. This study developed an innovative detection method combining Fourier Transform Infrared (FTIR) spectroscopy with a sophisticated machine learning algorithm. We analyzed 719 coconut milk samples (wavelength range 2500-4000 nm) consisting of traditional market products and instant commercial products. This study aims to develop an FTIR-based coconut milk adulteration detection model by optimizing RF and XGBoost parameters using CSA and evaluating the comparative performance of the two models in identifying different types of adulterants. The spectral data underwent rigorous preprocessing using a combination of Standard Normal Variate (SNV) and Savitzky-Golay (SG) techniques to overcome the effects of noise and light scattering, which significantly improved feature extraction. The results show that CSA-optimized XGBoost achieves superior performance with 92% accuracy and 91% F1 score, outperforming Random Forest in all evaluation metrics. The model shows particular strength in precision (98%), indicating its outstanding ability to minimize false positives in adulteration detection. Stability tests through 30 experimental repetitions reveal that the combination of XGBoost+CSA maintains consistent performance with minimal variance, confirming its reliability for industrial applications. Comparative analysis shows that the combination of SNV+SG preprocessing improves the accuracy of the baseline model by 9-12%, while CSA optimization provides an additional performance improvement of 10-15%. This research makes significant contributions to food science and safety. This study demonstrates the effectiveness of CSA in optimizing spectroscopic models, achieving 19.5% higher precision. The combination of SNV+SG preprocessing improves the baseline accuracy by 9-12%, while CSA optimization provides an additional performance improvement of 10-15%. This study not only provides a rapid and non-destructive adulteration detection solution but also proves the effectiveness of the CSA approach in optimizing the spectroscopic model. These findings have important implications for strengthening food safety regulations and developing real-time quality control systems in the coconut milk industry.
Effect of Feature Normalization and Distance Metrics on K-Nearest Neighbors Performance for Diabetes Disease Classification Yusran, Muhammad; Sadik, Kusman; Soleh, Agus M; Suhaeni, Cici
Journal of Mathematics, Computations and Statistics Vol. 8 No. 2 (2025): Volume 08 Nomor 02 (Oktober 2025)
Publisher : Jurusan Matematika FMIPA UNM

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.35580/jmathcos.v8i2.8012

Abstract

Diabetes is a global health issue with a steadily increasing prevalence each year. Early detection of the disease is an important step in preventing severe complications. The K-Nearest Neighbors (KNN) algorithm is often used in disease classification, but its performance is highly influenced by the choice of normalization method and distance metric used. This study aims to evaluate the effect of various normalization methods and distance metrics on the performance of the KNN algorithm in diabetes disease classification. The three normalization methods were employed: z-score normalization, min-max scaling, and median absolute deviation (MAD). In addition, the seven distance metrics were assessed: Euclidean, Manhattan, Chebyshev, Canberra, Hassanat, Lorentzian, and Clark. The dataset used is Pima Indians Diabetes which consists of 768 observations and 8 features. The data were split into 80% training data and 20% test data, and using 5-fold cross-validation to determine the optimal k value. The results show that the MAD-Canberra combination produces the highest overall accuracy, recall, and F1-score of 87.32%, 82.33%, and 81.94%, respectively. The highest precision was obtained from the Baseline-Hassanat combination at 86.96%, while the lowest performance was observed for the Z-Score-Chebyshev combination with F1-Score 58.02%. These results highlight that no single combination universally outperforms others, underscoring the need for empirical evaluation. Nonetheless, combining MAD normalization with metrics such as Canberra or Hassanat can serve as a strong starting point for developing KNN-based classification systems, especially in medical contexts that are sensitive to misclassification.
Robust Continuum Regression Study of LASSO Selection and WLAD LASSO on High-Dimensional Data Containing Outliers Daulay, Nurmai Syaroh; Erfiani, Erfiani; Soleh, Agus M
JTAM (Jurnal Teori dan Aplikasi Matematika) Vol 8, No 3 (2024): July
Publisher : Universitas Muhammadiyah Mataram

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.31764/jtam.v8i3.23123

Abstract

In research, we often encounter problems of multicollinearity and outliers, which can cause coefficients to become unstable and reduce model performance. Robust Continuum Regression (RCR) overcomes the problem of multicollinearity by reducing the number of independent variables, namely compressing the data into new variables (latent variables) that are independent of each other and whose dimensions are much smaller and applying robust regression techniques so that the complexity of the regression model can be reduced without losing essential information from data and provide more stable parameter estimates. However, it is hampered in the computational aspect if the data has very high dimensions (p>>n). In the initial stage, it is necessary to reduce dimensions by selecting variables. The Least Absolute Shrinkage and Selection Operator (LASSO) can overcome this but is sensitive to the presence of outliers, which can result in errors in selecting significant variables. Therefore, we need a method that is robust to outliers in selecting explanatory variables such as Weighted Least Absolute Deviations with LASSO penalty (WLAD LASSO) in selecting variables by considering the absolute deviation of the residuals. This method aims to overcome the problem of multicollinearity and model instability in high-dimensional data by paying attention to resistance to outliers. Leverages the outlier resistant RCR and variable selection capabilities of LASSO and WLAD LASSO to provide a more reliable and efficient solution for complex data analysis. Measure the performance of RKR-LASSO and RKR-WLAD LASSO; simulations were carried out using low-dimensional data and high-dimensional data with two scenarios, namely without outliers (δ= 0%) and with outliers (δ= 10%, 20%, 30%) with a level of correlation (ρ = 0.1,0.5,0.9). The analysis stage uses RStudio version 4.1.3 software using the "MASS" package to generate data that has a multivariate normal distribution, the "glmnet" package for LASSO variable selection, the "MTE" package for WLAD LASSO variable selection. The simulation results show the performance of RKR-LASSO tends to be superior in terms of model goodness of fit compared to RKR-WLAD LASSO. However, the performance of RKR-LASSO tends to decrease as outliers and correlations increase. RKR-LASSO tends to be looser in selecting relevant variables, resulting in a simpler model, but the variables chosen by LASSO are only marginally significant. RKR-WLAD LASSO is stricter in variable selection and only selects significant variables but ignores several variables that have a small but significant impact on the model.
Comparison of LASSO, Ridge, and Elastic Net Regularization with Balanced Bagging Classifier Nisrina Az-Zahra, Putri; Sadik, Kusman; Suhaeni, Cici; Mohamad Soleh, Agus
Parameter: Jurnal Matematika, Statistika dan Terapannya Vol 4 No 2 (2025): Parameter: Jurnal Matematika, Statistika dan Terapannya
Publisher : Jurusan Matematika FMIPA Universitas Pattimura

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30598/parameterv4i2pp287-296

Abstract

Predicting Drug-Induced Autoimmunity (DIA) is crucial in pharmaceutical safety assessment, as early identification of compounds with autoimmune risk can prevent adverse drug reactions and improve patient outcomes. Classification analysis often faces challenges when the number of predictor variables exceeds the number of observations or when high correlations among predictors lead to multicollinearity and overfitting. Regularization methods, such as Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), and Elastic-Net, help stabilize parameter estimation and improve model interpretability. This study focuses on building a binary classification model to predict the risk of DIA using 196 molecular descriptors derived from chemical compound structures. To address class imbalance in the response variable, the Balanced Bagging Classifier (BBC) is combined with regularized logistic regression models. Elastic Net + BBC outperforms other models with the highest accuracy (0.825), followed closely by LASSO + BBC and Ridge + BBC (both 0.816). This integration not only improves classification accuracy but also enhances generalization and the reliable detection of minority class instances, supporting the early identification of autoimmune risks in drug discovery.
EVALUATING RANDOM FOREST AND XGBOOST FOR BANK CUSTOMER CHURN PREDICTION ON IMBALANCED DATA USING SMOTE AND SMOTE-ENN Andespa, Reyuli; Sadik, Kusman; Suhaeni, Cici; Soleh, Agus M
MEDIA STATISTIKA Vol 18, No 1 (2025): Media Statistika
Publisher : Department of Statistics, Faculty of Science and Mathematics, Universitas Diponegoro

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.14710/medstat.18.1.25-36

Abstract

The banking industry faces significant challenges in retaining customers, as churn can critically affect both revenue and reputation. This study introduces a robust churn prediction framework by comparing the performance of XGBoost and Random Forest algorithms under imbalanced data conditions. The novelty of this research lies in integrating the SMOTE and SMOTE-ENN techniques with machine learning algorithms to enhance model performance and reliability on highly imbalanced datasets. Unlike conventional approaches that rely solely on oversampling or undersampling, this study demonstrates that the hybrid combination of XGBoost and SMOTE provides superior predictive accuracy, stability, and efficiency. Hyperparameter optimization using GridSearchCV was conducted to identify the most effective parameter configurations for both algorithms. Model performance was evaluated using the F1-Score and Area Under the Curve (AUC). The results indicate that XGBoost with SMOTE achieved the best performance, with an F1-Score of 0.8730 and an AUC of 0.9828, showing an optimal balance between precision and recall. Feature importance analysis identified Months_Inactive_12_mon, Total_Trans_Amt, and Total_Relationship_Count as the most influential predictors. Overall, this approach outperforms traditional resampling and modeling techniques, providing practical insights for data-driven customer retention strategies in the banking industry.
Deep Learning Image Classification Rontgen Dada pada Kasus Covid-19 Menggunakan Algoritma Convolutional Neural Network Susanti, Leni Anggraini; Soleh, Agus Mohamad; Sartono, Bagus
Jurnal Teknologi Informasi dan Ilmu Komputer Vol 10 No 5: Oktober 2023
Publisher : Fakultas Ilmu Komputer, Universitas Brawijaya

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.25126/jtiik.2023107142

Abstract

Penelitian ini mengusulkan penggunaan Convolutional Neural Network (CNN) dengan arsitektur VGGNet-19 dan ResNet-50 untuk diagnosis COVID-19 melalui analisis citra rontgen dada. Modifikasi dilakukan dengan membandingkan nilai regularisasi dropout 50% dan 80% untuk kedua arsitektur dan mengubah jumlah lapisan klasfikasi menjadi 4 kelas. Selanjutnya, kinerja model dibandingkan berdasarkan ukuran dataset. Dataset terdiri dari 21165 citra, dengan pembagian 10% sebagai data uji dan 90% data dibagi menjadi data latih (80%) dan data validasi (20%). Kinerja model dievaluasi menggunakan metode validasi silang berulang 5 kali lipat. Proses pelatihan menggunakan learning rate 0.0001, optimasi stochastic gradient descent (SGD), dan sepuluh iterasi. Hasil penelitian menunjukkan bahwa penambahan lapisan dropout dengan peluang 50% untuk kedua arsitektur secara efektif mengatasi overfitting dan meningkatkan performa model. Ditemukan bahwa kinerja yang lebih baik dicapai pada ukuran kumpulan data lebih besar dan memberikan peningkatan signifikan pada kinerja model. Hasil klasifikasi menunjukkan arsitektur ResNet-50 mencapai akurasi rata-rata 94.4%, recall rata-rata 94.1%, presisi rata-rata 95.5%, spesifisitas rata-rata 97% dan F1-score rata-rata 94.8%. Sedangkan arsitektur VGGNet-19 mencapai akurasi rata-rata 91%, recall rata-rata 89%, presisi rata-rata 95.0%, spesifisitas rata-rata 96.8% dan F1-score rata-rata 92.7%. Pemanfaatan model ini dapat membantu mengidentifikasi penyebab kematian pasien dan memberikan informasi yang berharga bagi pengambilan keputusan medis dan epidemiologi.   Abstract This research proposes using a Convolutional Neural Network (CNN) with VGGNet-19 and ResNet-50 architectures for COVID-19 diagnosis through chest X-ray image analysis. Modifications were made by comparing the dropout regularization values of 50% and 80% for both architectures and altering the number of classification layers to 4 classes. Furthermore, the model's performance was compared based on dataset size. The dataset comprised 21,165 images, with a division of 10% for testing and 90% divided into training data (80%) and validation data (20%). The model's performance was evaluated using the 5-fold repeat cross-validation method. The training process employed a learning rate of 0.0001, stochastic gradient descent (SGD) optimization, and ten iterations. The study's results indicate that adding dropout layers with a 50% probability for both architectures effectively addressed overfitting and improved the model's performance. It was found that better performance was achieved with larger dataset sizes. The classification results indicate the ResNet-50 architecture achieved an average accuracy of 94.4%, average recall of 94.1%, average precision of 95.5%, average specificity of 97%, and average F1-score of 94.8%. Meanwhile, the VGGNet-19 architecture achieved an average accuracy of 91%, an average recall of 89%, average precision of 95.0%, average specificity of 96.8%, and an average F1-score of 92.7%. Utilizing these models can assist in identifying the causes of patient mortality and offer valuable information for medical and epidemiological decision-making.
Classification Modeling with RNN-based, Random Forest, and XGBoost for Imbalanced Data: A Case of Early Crash Detection in ASEAN-5 Stock Markets Siswara, Deri; M. Soleh, Agus; Hamim Wigena, Aji
Scientific Journal of Informatics Vol. 11 No. 3: August 2024
Publisher : Universitas Negeri Semarang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.15294/sji.v11i3.4067

Abstract

Purpose: This research aims to evaluate the performance of several Recurrent Neural Network (RNN) architectures, including Simple RNN, Gated Recurrent Units (GRU), and Long Short-Term Memory (LSTM), compared to classic algorithms such as Random Forest and XGBoost, in building classification models for early crash detection in the ASEAN-5 stock markets. Methods: The study examines imbalanced data, which is expected due to the rarity of market crashes. It analyzes daily data from 2010 to 2023 across the major stock markets of the ASEAN-5 countries: Indonesia, Malaysia, Singapore, Thailand, and the Philippines. A market crash is the target variable when the primary stock price indices fall below the Value at Risk (VaR) thresholds of 5%, 2.5%, and 1%. Predictors include technical indicators from major local and global markets and commodity markets. The study incorporates 213 predictors with their respective lags (5, 10, 15, 22, 50, 200) and uses a time step of 7, expanding the total number of predictors to 1,491. The challenge of data imbalance is addressed with SMOTE-ENN. Model performance is evaluated using the false alarm rate, hit rate, balanced accuracy, and the precision-recall curve (PRC) score. Result: The results indicate that all RNN-based architectures outperform Random Forest and XGBoost. Among the various RNN architectures, Simple RNN is the most superior, primarily due to its simple data characteristics and focus on short-term information. Novelty: This study enhances and extends the range of phenomena observed in previous studies by incorporating variables such as different geographical zones and periods and methodological adjustments.