Claim Missing Document
Check
Articles

Performance Evaluation of ARDL Model Stacked with Boosted Ridge Regression on Time Series Data with Multicollinearity: Evaluasi Kinerja Estimasi Model ARDL stacked with Boosted Ridge Regression pada Data Deret Waktu dengan Multikolinearitas Dalimunthe, Amir Abduljabbar; Soleh, Agus Mohamad; Afendi, Farit Mochamad
Indonesian Journal of Statistics and Applications Vol 9 No 1 (2025)
Publisher : Departemen Statistika, IPB University dengan Forum Perguruan Tinggi Statistika (FORSTAT)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29244/ijsa.v9i1p136-144

Abstract

Time series data plays a vital role in financial and economic study. Two commonly applied models for such data are Vector Autoregression (VAR) and Autoregressive Distributed Lags (ARDL). Nonetheless, interdependence among explanatory variables often leads to multicollinearity, posing challenges for model reliability. This study investigates the effectiveness of the ARDL model integrated with boosted ridge regression as a method to mitigate multicollinearity. Due to limitations in available empirical data, simulation data will be generated to support the analysis. The research consists of two stages: synthetic data generation and analysis on simulated data. Results suggest that ARDL performs well under various multicollinearity conditions, particularly when the training set is sufficiently large and model structure is correctly specified. For smaller training sets, the ARDL Ridge variant demonstrates improved predictive performance.
Rotation Double Random Forest Algorithm to Predict The Food Insecurity Status of Households Rais; Agus Mohamad Soleh; Budi Susetyo
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol 8 No 1 (2024): February 2024
Publisher : Ikatan Ahli Informatika Indonesia (IAII)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29207/resti.v8i1.5540

Abstract

The ensemble tree method has been proven to handle classification problems well. The strength of the ensemble tree technique lies in the diversity and independence between each tree. Increasing the diversity of mutually independent decision trees improves the performance of the model. Various studies propose the development of ensemble tree-based models by forming algorithms that create decision trees that are formed independently of each other and have various inputs. These include random forest (RF), rotation forest (RoF), double random forest (DRF), and the latest is rotation double random forest (RoDRF). RoDRF rotates or transforms data with the intent of producing better diversity among the learning base. RoDRF applies the concept of variable rotation to trees based on the DRF algorithm. Random rotations or transformations on different feature subspaces produce different projections, leading to better generalization or prediction performance. This research aims to compare the performance of RoDRF with the RF, RoF, and DRF models on unbalanced data in cases of food insecurity. Class imbalance will be handled with two methods, namely EasyEnsemble and SMOTE-NC. The research results show that the DRF's model with EasyEnsemble techniques produces a model with the best performance among several algorithms tested. Although the resulting precision is 0.62274 and the AUC value is 0.68501, the model can predict each class equally. All algorithms with EasyEnsemble treatment have average AUC values significantly different from each other based on statistical test results. This research also used SHAP to explain variables that significantly contribute to the household's food insecurity status model.
LR-GLASSO Method for Solving Multiple Explanatory Variables of the Village Development Index Yunus, M.; Soleh, Agus M; Saefuddin, Asep; Erfiani
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol 8 No 2 (2024): April 2024
Publisher : Ikatan Ahli Informatika Indonesia (IAII)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29207/resti.v8i2.5656

Abstract

Sustainable Development Goals (SDGs) are developments that maintain sustainable improvement in society’s economic, social, and environmental welfare. Kemendes PDTT RI has issued the Village Development Index (VDI) to provide information and the status of village progress to support village development to improve the National SDGS. Modeling with multiple explanatory variables causes a high correlation between explanatory variables, multicollinearity, and coefficient estimation results, which have a large variance and overfitting in the prediction results. The modeling solution uses LASSO and GLASSO. The binary categorical response data use binary logistic regression (LR), so LR-LASSO and LR-GLASSO are used. North Maluku Province has a VDI ranking that tends to fall in 2018-2022. On the basis of the mean and variance of the coefficient estimation results and misclassification errors, LR-GLASSO is better than LR-LASSO and LR. LR-GLASSO is recommended for analyzing VDI data because it has many explanatory variables and the correlation between them is relatively high. The Indonesian government recommendation, if it is to increase the status of VDI in Indonesia, especially in the north Maluku province, is to increase the number of electricity users, food and beverage stores, and other cooperatives. The Indonesian government also needs to pay attention to villages relatively far from the regent's office, between food and beverage stalls, and supporting health centers, because they still need to be developed compared to other villages, and more than 50% of the villages are underdeveloped. If the Village SDGs are formulated by increasing the VDI status, it will support the achievement of the SDGs goals nationally.
Comparison of clustering analysis of K-means, K-medoids, and fuzzy C-means methods: case study of school accreditation in west java Hasnataeni, Yunia; Nurhambali, M Rizky; Ardhani, Rizky; Hafsah, Siti; Soleh, Agus M
Journal of Soft Computing Exploration Vol. 6 No. 2 (2025): June 2025
Publisher : SHM Publisher

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.52465/joscex.v6i2.575

Abstract

This research aims to analyze school accreditation data in West Java using clustering methods: K-Means, K-Medoids, and Fuzzy C-Means, to identify patterns and groups of schools based on similar characteristics. K-Means, known for its simplicity, suggests an optimal two-cluster solution based on silhouette values but employs three clusters for detailed analysis. K-Medoids, noted for its robustness against outliers, achieves the best clustering with a lowest Davies-Bouldin Index (DBI) of 0.8 and the highest Silhouette Information (SI) value of 0.46. Fuzzy C-Means, which assigns membership degrees to each data point across clusters, performs reasonably well with a DBI of 0.87 and an SI value of 0.40, while K-Means shows the highest DBI of 0.9 and the lowest SI value of 0.39. The findings highlight K-Medoids as the superior method for clustering. Regions with lower educational quality, such as Bekasi and Cianjur regions, require priority interventions, whereas areas with better quality, like Bandung and Bekasi regions, can serve as models. Data-driven approaches, inter-regional collaboration, and continuous monitoring and evaluation are recommended to optimize educational policies and enhance overall educational quality in West Java.
FUNCTION GROUP SELECTION OF SEMBUNG LEAVES (BLUMEA BALSAMIFERA) SIGNIFICANT TO ANTIOXIDANTS USING OVERLAPPING GROUP LASSO kusnaeni, kusnaeni; Soleh, Agus M; Afendi, Farit M; Sartono, Bagus
BAREKENG: Jurnal Ilmu Matematika dan Terapan Vol 16 No 2 (2022): BAREKENG: Jurnal Ilmu Matematika dan Terapan
Publisher : PATTIMURA UNIVERSITY

Show Abstract | Download Original | Original Source | Check in Google Scholar | Full PDF (476.663 KB) | DOI: 10.30598/barekengvol16iss2pp721-728

Abstract

Functional groups of sembung leaf metabolites can be detected using FTIR spectrometry by looking at the spectrum's shape from specific peaks that indicate the type of functional group of a compound. There were 35 observations and 1866 explanatory variables (wavelength) in this study. The number of explanatory variables more than the number of observations is high-dimensional data. One method that can be used to analyze high-dimensional data is penalized regression. The overlapping group lasso method is a development of the group-based penalized regression method that can solve the problem of selecting variable groups and members of overlapping groups of variables. The results of selecting the variable groups using the overlapping group lasso method found that the functional groups that were significant for the antioxidants of sembung leaves were C=C Unstructured, CN amide, Polyphenol, Sio2.
TEXT CLUSTERING ONLINE LEARNING OPINION DURING COVID-19 PANDEMIC IN INDONESIA USING TWEETS Tyas, Maulida Fajrining; Kurnia, Anang; Soleh, Agus Mohamad
BAREKENG: Jurnal Ilmu Matematika dan Terapan Vol 16 No 3 (2022): BAREKENG: Journal of Mathematics and Its Applications
Publisher : PATTIMURA UNIVERSITY

Show Abstract | Download Original | Original Source | Check in Google Scholar | Full PDF (1033.901 KB) | DOI: 10.30598/barekengvol16iss3pp939-948

Abstract

To prevent the spread of corona virus, restriction of social activities are implemented including school activities which reaps the pros and cons in community. Opinions about online learning are widely conveyed mainly on Twitter. Tweets obtained can be used to extract information using text clustering to group topics about online learning during pandemic in Indonesia. K-Means is often used and has good performance in text clustering area. However, the problem of high dimensionality in textual data can result in difficult computations so that a sampling method is proposed. This paper aims to examine whether a sampling method to cluster tweets can result to an efficient clustering than using the whole dataset. After pre-processing, five sample sizes are selected from 28300 tweets which are 250, 500, 2500, 10000 and 20000 to conduct K-Means clustering. Results showed that from 10 iterations, three main cluster topics appeared 90%-100% in sample size of 2500, 10000 and 20000. Meanwhile sample size of 250 and 500 tend to produced 20%-60% appearance of the three main cluster topics. This means that around 8% to 35% of tweets used can yield representative clusters and efficient computation which is four times faster than using entire dataset.
Performance Comparison of Random Forest and XGBoost Optimized with Cuckoo Search Algorithm for Coconut Milk Adulteration Detection Using FTIR Spectroscopy I Gusti Ngurah, Sentana Putra; Kusman Sadik; Agus Mohamad Soleh; Cici Suhaeni
Journal of Mathematics, Computations and Statistics Vol. 8 No. 2 (2025): Volume 08 Nomor 02 (Oktober 2025)
Publisher : Jurusan Matematika FMIPA UNM

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.35580/jmathcos.v8i2.7817

Abstract

Coconut milk has emerged as a strategic food commodity in the global tropical region, with market demand growing at 7.2% per annum since 2021. This increasing demand has led to sophisticated adulteration practices, including dilution with water. Such adulteration not only reduces the nutritional value but also poses serious health risks, including food poisoning and allergic reactions. This study developed an innovative detection method combining Fourier Transform Infrared (FTIR) spectroscopy with a sophisticated machine learning algorithm. We analyzed 719 coconut milk samples (wavelength range 2500-4000 nm) consisting of traditional market products and instant commercial products. This study aims to develop an FTIR-based coconut milk adulteration detection model by optimizing RF and XGBoost parameters using CSA and evaluating the comparative performance of the two models in identifying different types of adulterants. The spectral data underwent rigorous preprocessing using a combination of Standard Normal Variate (SNV) and Savitzky-Golay (SG) techniques to overcome the effects of noise and light scattering, which significantly improved feature extraction. The results show that CSA-optimized XGBoost achieves superior performance with 92% accuracy and 91% F1 score, outperforming Random Forest in all evaluation metrics. The model shows particular strength in precision (98%), indicating its outstanding ability to minimize false positives in adulteration detection. Stability tests through 30 experimental repetitions reveal that the combination of XGBoost+CSA maintains consistent performance with minimal variance, confirming its reliability for industrial applications. Comparative analysis shows that the combination of SNV+SG preprocessing improves the accuracy of the baseline model by 9-12%, while CSA optimization provides an additional performance improvement of 10-15%. This research makes significant contributions to food science and safety. This study demonstrates the effectiveness of CSA in optimizing spectroscopic models, achieving 19.5% higher precision. The combination of SNV+SG preprocessing improves the baseline accuracy by 9-12%, while CSA optimization provides an additional performance improvement of 10-15%. This study not only provides a rapid and non-destructive adulteration detection solution but also proves the effectiveness of the CSA approach in optimizing the spectroscopic model. These findings have important implications for strengthening food safety regulations and developing real-time quality control systems in the coconut milk industry.
Effect of Feature Normalization and Distance Metrics on K-Nearest Neighbors Performance for Diabetes Disease Classification Yusran, Muhammad; Sadik, Kusman; Soleh, Agus M; Suhaeni, Cici
Journal of Mathematics, Computations and Statistics Vol. 8 No. 2 (2025): Volume 08 Nomor 02 (Oktober 2025)
Publisher : Jurusan Matematika FMIPA UNM

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.35580/jmathcos.v8i2.8012

Abstract

Diabetes is a global health issue with a steadily increasing prevalence each year. Early detection of the disease is an important step in preventing severe complications. The K-Nearest Neighbors (KNN) algorithm is often used in disease classification, but its performance is highly influenced by the choice of normalization method and distance metric used. This study aims to evaluate the effect of various normalization methods and distance metrics on the performance of the KNN algorithm in diabetes disease classification. The three normalization methods were employed: z-score normalization, min-max scaling, and median absolute deviation (MAD). In addition, the seven distance metrics were assessed: Euclidean, Manhattan, Chebyshev, Canberra, Hassanat, Lorentzian, and Clark. The dataset used is Pima Indians Diabetes which consists of 768 observations and 8 features. The data were split into 80% training data and 20% test data, and using 5-fold cross-validation to determine the optimal k value. The results show that the MAD-Canberra combination produces the highest overall accuracy, recall, and F1-score of 87.32%, 82.33%, and 81.94%, respectively. The highest precision was obtained from the Baseline-Hassanat combination at 86.96%, while the lowest performance was observed for the Z-Score-Chebyshev combination with F1-Score 58.02%. These results highlight that no single combination universally outperforms others, underscoring the need for empirical evaluation. Nonetheless, combining MAD normalization with metrics such as Canberra or Hassanat can serve as a strong starting point for developing KNN-based classification systems, especially in medical contexts that are sensitive to misclassification.
Robust Continuum Regression Study of LASSO Selection and WLAD LASSO on High-Dimensional Data Containing Outliers Daulay, Nurmai Syaroh; Erfiani, Erfiani; Soleh, Agus M
JTAM (Jurnal Teori dan Aplikasi Matematika) Vol 8, No 3 (2024): July
Publisher : Universitas Muhammadiyah Mataram

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.31764/jtam.v8i3.23123

Abstract

In research, we often encounter problems of multicollinearity and outliers, which can cause coefficients to become unstable and reduce model performance. Robust Continuum Regression (RCR) overcomes the problem of multicollinearity by reducing the number of independent variables, namely compressing the data into new variables (latent variables) that are independent of each other and whose dimensions are much smaller and applying robust regression techniques so that the complexity of the regression model can be reduced without losing essential information from data and provide more stable parameter estimates. However, it is hampered in the computational aspect if the data has very high dimensions (p>>n). In the initial stage, it is necessary to reduce dimensions by selecting variables. The Least Absolute Shrinkage and Selection Operator (LASSO) can overcome this but is sensitive to the presence of outliers, which can result in errors in selecting significant variables. Therefore, we need a method that is robust to outliers in selecting explanatory variables such as Weighted Least Absolute Deviations with LASSO penalty (WLAD LASSO) in selecting variables by considering the absolute deviation of the residuals. This method aims to overcome the problem of multicollinearity and model instability in high-dimensional data by paying attention to resistance to outliers. Leverages the outlier resistant RCR and variable selection capabilities of LASSO and WLAD LASSO to provide a more reliable and efficient solution for complex data analysis. Measure the performance of RKR-LASSO and RKR-WLAD LASSO; simulations were carried out using low-dimensional data and high-dimensional data with two scenarios, namely without outliers (δ= 0%) and with outliers (δ= 10%, 20%, 30%) with a level of correlation (ρ = 0.1,0.5,0.9). The analysis stage uses RStudio version 4.1.3 software using the "MASS" package to generate data that has a multivariate normal distribution, the "glmnet" package for LASSO variable selection, the "MTE" package for WLAD LASSO variable selection. The simulation results show the performance of RKR-LASSO tends to be superior in terms of model goodness of fit compared to RKR-WLAD LASSO. However, the performance of RKR-LASSO tends to decrease as outliers and correlations increase. RKR-LASSO tends to be looser in selecting relevant variables, resulting in a simpler model, but the variables chosen by LASSO are only marginally significant. RKR-WLAD LASSO is stricter in variable selection and only selects significant variables but ignores several variables that have a small but significant impact on the model.
Comparison of LASSO, Ridge, and Elastic Net Regularization with Balanced Bagging Classifier Nisrina Az-Zahra, Putri; Sadik, Kusman; Suhaeni, Cici; Mohamad Soleh, Agus
Parameter: Jurnal Matematika, Statistika dan Terapannya Vol 4 No 2 (2025): Parameter: Jurnal Matematika, Statistika dan Terapannya
Publisher : Jurusan Matematika FMIPA Universitas Pattimura

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30598/parameterv4i2pp287-296

Abstract

Predicting Drug-Induced Autoimmunity (DIA) is crucial in pharmaceutical safety assessment, as early identification of compounds with autoimmune risk can prevent adverse drug reactions and improve patient outcomes. Classification analysis often faces challenges when the number of predictor variables exceeds the number of observations or when high correlations among predictors lead to multicollinearity and overfitting. Regularization methods, such as Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), and Elastic-Net, help stabilize parameter estimation and improve model interpretability. This study focuses on building a binary classification model to predict the risk of DIA using 196 molecular descriptors derived from chemical compound structures. To address class imbalance in the response variable, the Balanced Bagging Classifier (BBC) is combined with regularized logistic regression models. Elastic Net + BBC outperforms other models with the highest accuracy (0.825), followed closely by LASSO + BBC and Ridge + BBC (both 0.816). This integration not only improves classification accuracy but also enhances generalization and the reliable detection of minority class instances, supporting the early identification of autoimmune risks in drug discovery.