Claim Missing Document
Check
Articles

Manifold Learning and Undersampling Approaches for Imbalanced Class Sentiment Classification Jumansyah, L. M. Risman Dwi; Soleh, Agus Mohamad; Syafitri, Utami Dyah
Knowledge Engineering and Data Science Vol 7, No 2 (2024)
Publisher : Universitas Negeri Malang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.17977/um018v7i22024p139-151

Abstract

Movie reviews are crucial in determining a film's success by influencing audience decisions. Automating sentiment classification is essential for efficient public opinion analysis. However, it faces challenges such as high-dimensional data and imbalanced class distributions. This study addresses these issues by applying manifold learning techniques, Principal Component Analysis (PCA) and Laplacian Eigenmaps (LE) to reduce data complexity and undersampling strategies (Random Undersampling (RUS) and EasyEnsemble) to balance data and improve predictions for both sentiment classes. On reviews of The Raid 2: Berandal, EasyEnsemble achieved the highest average G-Mean of 0.694 using Term Frequency-Inverse Document Frequency (TF-IDF) features with a linear kernel without dimensionality reduction. RUS provided balanced but inconsistent results, while Review of Systems (ROS) combined with PCA (85% variance cumulative) improved predictions for negative reviews. Laplacian Eigenmaps were effective for negative reviews with 500 dimensions but less accurate for positive ones. This study highlights EasyEnsemble's superior performance in addressing the class imbalance, though optimization with manifold learning remains challenging.
Loan Approval Classification Using Ensemble Learning on Imbalanced Data Anadra, Rahmi; Sadik, Kusman; Soleh, Agus M; Astari, Reka Agustia
Enthusiastic : International Journal of Applied Statistics and Data Science Volume 4 Issue 2, October 2024
Publisher : Universitas Islam Indonesia

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.20885/enthusiastic.vol4.iss2.art1

Abstract

Loan processing is an important aspect of the financial industry, where the right decisions must be made to determine loan approval or rejection. However, the issue of default by loan applicants has become a significant concern for financial institutions. Hence, ensemble learning needs to be used with random forest and Extreme Gradient Boosting (XGBoost) algorithms. Unbalanced data are handled using the Synthetic Minority Over-sampling Technique (SMOTE). This research aimed to improve accuracy and precision in credit risk assessment to reduce human workload. Both algorithms used a dataset of 4,296 with 13 variables relevant to making loan approval decisions. The research process involved data exploration, data preprocessing, data sharing, model training, model evaluation with accuracy, sensitivity, specificity, and F1-score, model selection with 10-fold cross-validation, and important variables. The results showed that XGBoost with imbalanced data handling had the highest accuracy rate of 98.52% and a good balance between sensitivity of 98.83%, specificity of 98.01, and F1-score of 98.81%. The most important variables in determining loan approval are credit score, loan term, loan amount, and annual income.
Characteristics of Machine Learning-based Univariate Time Series Imputation Method Ramadhani, Dini; Soleh, Agus Mohamad; Erfiani, Erfiani
JUITA: Jurnal Informatika JUITA Vol. 12 No. 2, November 2024
Publisher : Department of Informatics Engineering, Universitas Muhammadiyah Purwokerto

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30595/juita.v12i2.23453

Abstract

Handling missing values in univariate time series analysis poses a challenge, potentially leading to inaccurate conclusions, especially with frequently occurring consecutive missing values. Machine Learning-based Univariate Time Series Imputation (MLBUI) methods, utilizing Random Forest Regression (RFR) and Support Vector Regression (SVR), aim to address this challenge. Considering factors such as time series patterns, missing data patterns, and volume, this study explores the performance of MLBUI in simulated Autoregressive Integrated Moving Average (ARIMA) datasets. Various missing data scenarios (6%, 10%, and 14%) and model scenarios (Autoregressive (AR) models: AR(1) and AR(2); Moving Average (MA) models: MA(1) and MA(2); Autoregressive Moving Average (ARMA) models: ARMA(1,1) and ARMA(2,2); and Autoregressive Integrated Moving Average (ARIMA) models: ARIMA(1,1,1) and ARIMA(1,2,1)) with different standard deviations (0.5, 1, and 2) were examined. Five comparative methods were also used in this research, including Kalman StructTS, Kalman Auto-ARIMA, Spline Interpolation, Stine Interpolation, and Moving Average. The research findings indicate that MLBUI performs exceptionally well in imputing successive missing values. The results of this study indicate that the performance of MLBUI in imputing consecutive missing values, based on MAPE, yielded values of less than 10% across all scenarios used.
BINOMIAL REGRESSION IN SMALL AREA ESTIMATION METHOD FOR ESTIMATE PROPORTION OF CULTURAL INDICATOR Yudistira Yudistira; Anang Kurnia; Agus Mohamad Soleh
Indonesian Journal of Statistics and Applications Vol 2 No 2 (2018)
Publisher : Departemen Statistika, IPB University dengan Forum Perguruan Tinggi Statistika (FORSTAT)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29244/ijsa.v2i2.63

Abstract

In sampling survey, it was necessary to have sufficient sample size in order to get accurate direct estimator about parameter, but there are many difficulties to fulfill them in practice. Small Area Estimation (SAE) is one of alternative methods to estimate parameter when sample size is not adequate. This method has been widely applied in such variation of model and many fields of research. Our research mainly focused on study how SAE method with binomial regression model is applied to obtained estimate proportion of cultural indicator, especially to estimate proportion of people who appreciate heritages and museums in each regency/city level in West Java Province. Data analysis approach used in our research with resurrected data and variables in order to be compared with previous research. The result later showed that binomial regression model could be used to estimate proportion of cultural indicator in Regency/City in Indonesia with better result than direct estimation method.
PENENTUAN NILAI AMBANG BATAS SEBARAN PARETO TERAMPAT DENGAN MEASURE OF SURPRISE Yumna Karimah; Aji Hamim Wigena; Agus Mohamad Soleh
Indonesian Journal of Statistics and Applications Vol 3 No 2 (2019)
Publisher : Departemen Statistika, IPB University dengan Forum Perguruan Tinggi Statistika (FORSTAT)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29244/ijsa.v3i2.284

Abstract

Extreme rainfall can result in natural disasters such as floods and landslides. These natural disasters will cause damage and losses to the surrounding environment. Prevention of damage from natural disasters can be done by extreme rainfall estimation. Estimates of extreme rainfall are based on Generalized Pareto Distribution (GPD) which requires threshold value information. The threshold value can be determined by two methods, namely Mean Residual Life Plot (MRLP) and Measure of Surprise (MOS). The purpose of this study is to determine and compare the threshold values ​​of MRLP and MOS. The data used are 10-day and monthly rainfall data. The results of this study indicate that the procedure of MOS is shorter and easier than that of MRLP. Based on the cross validation result, the log-likelihood value of MOS is larger than that of MRLP, then MOS is better than MRLP.
Klasifikasi Halaman SEO Berbasis Machine Learning Melalui Mutual Information dan Random Forest Feature Importance NURADILLA, SITI; SADIK, KUSMAN; SUHAENI, CICI; SOLEH, AGUS M
MIND (Multimedia Artificial Intelligent Networking Database) Journal Vol 10, No 1 (2025): MIND Journal
Publisher : Institut Teknologi Nasional Bandung

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.26760/mindjournal.v10i1.114-129

Abstract

AbstrakProses optimasi SEO melibatkan banyak faktor yang saling terkait, sehingga sulit bagi tim SEO dalam menentukan halaman mana yang memerlukan perbaikan lebih lanjut. Penelitian ini bertujuan untuk mengembangkan model berbasis machine learning yang tidak hanya akurat dalam mengklasifikasikan halaman, tetapi juga efisien dalam memilih fitur yang paling informatif. Metode yang digunakan dalam penelitian ini melibatkan seleksi fitur menggunakan Mutual Information (MI) dan Random Forest Feature Importance (RFFI) untuk mengidentifikasi faktor-faktor yang paling penting untuk optimasi SEO, yang dimodelkan menggunakan Random Forest dan Weighted Voting Ensemble (WVE). Model dievaluasi berdasarkan Accuracy, Precision, Recall, dan ROC AUC. Hasil penelitian menunjukkan bahwa model Random Forest dengan 20 fitur berdasarkan RFFI, memberikan performa terbaik dengan ROC AUC sebesar 75.87%, Accuracy 77,74%, Precision 60,51%, dan Recall 71.29%. Model mampu membedakan secara efektif halaman yang membutuhkan optimasi SEO atau tidak.Kata kunci: Feature Importance, Random Forest, SEO, Seleksi Variabel, WVEAbstractThe SEO optimization process involves many interrelated factors, making it challenging to identify which pages need further improvement. This study proposes a machine learning-based model that is accurate in classifying web pages and efficient in selecting the most relevant features. Feature selection is performed using Mutual Information (MI) and Random Forest Feature Importance (RFFI) to identify key factors for SEO optimization, followed by modeling with Random Forest and Weighted Voting Ensemble (WVE). The model is evaluated using Accuracy, Precision, Recall, and ROC AUC. Results indicate that the Random Forest model with 20 features selected via RFFI delivers the best performance, achieving a ROC AUC of 75.87%, Accuracy of 77.74%, Precision of 60.51%, and Recall of 71.29%. The model effectively distinguishes between pages that require SEO optimization and those that do not.Keywords: Feature Importance, Random Forest, SEO, Variable Selection, WVE
Multilevel Semiparametric Modeling with Overdispersion and Excess Zeros on School Dropout Rates in Indonesia Tarida, Arna Ristiyanti; Djuraidah, Anik; Soleh, Agus Mohamad
JTAM (Jurnal Teori dan Aplikasi Matematika) Vol 9, No 3 (2025): July
Publisher : Universitas Muhammadiyah Mataram

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.31764/jtam.v9i3.30102

Abstract

This study aims to identify key factors influencing high school dropout rates in Indonesia by applying advanced statistical modeling that accounts for complex data characteristics. Dropout data often display overdispersion (variability greater than expected) and excess zeros (many students not dropping out), which, if ignored, can bias conclusions.  To address this, we compare parametric models, Zero-Inflated Poisson Mixed Model (ZIPMM), Zero-Inflated Generalized Poisson Mixed Model (ZIGPMM), and Zero-Inflated Negative Binomial Mixed Model (ZINBMM), with their semiparametric counterparts (SZIPMM, SZIGPMM, SZINBMM). The semiparametric models use B-spline functions to capture nonlinear relationships between predictors and dropout rates, with flexibility. Model performance was evaluated using Akaike Information Criterion (AIC) and Root Mean Square Error (RMSE) across 100 simulation repetitions to ensure robustness. Results show that the semiparametric ZIGPMM (SZIGPMM) outperformed other models, achieving the lowest average AIC (18969.62), suggesting the best trade-off between model fit and complexity. The optimal spline configuration used knot point 2 and order 3, with a Generalized Cross-Validation (GCV) score of 9.4107. Key predictors of dropout include school status (public or private), student-teacher ratio, distance from home to school, parental education level, parental employment status, and number of siblings. These findings provide actionable insights for education policymakers, emphasizing the need to address structural and socioeconomic barriers to reduce dropout rates effectively.
Performance Evaluation of ARDL Model Stacked with Boosted Ridge Regression on Time Series Data with Multicollinearity: Evaluasi Kinerja Estimasi Model ARDL stacked with Boosted Ridge Regression pada Data Deret Waktu dengan Multikolinearitas Dalimunthe, Amir Abduljabbar; Soleh, Agus Mohamad; Afendi, Farit Mochamad
Indonesian Journal of Statistics and Applications Vol 9 No 1 (2025)
Publisher : Statistics and Data Science Program Study, IPB University, IPB University, in collaboration with the Forum Pendidikan Tinggi Statistika Indonesia (FORSTAT) and the Ikatan Statistisi Indonesia (ISI)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29244/ijsa.v9i1p136-144

Abstract

Time series data plays a vital role in financial and economic study. Two commonly applied models for such data are Vector Autoregression (VAR) and Autoregressive Distributed Lags (ARDL). Nonetheless, interdependence among explanatory variables often leads to multicollinearity, posing challenges for model reliability. This study investigates the effectiveness of the ARDL model integrated with boosted ridge regression as a method to mitigate multicollinearity. Due to limitations in available empirical data, simulation data will be generated to support the analysis. The research consists of two stages: synthetic data generation and analysis on simulated data. Results suggest that ARDL performs well under various multicollinearity conditions, particularly when the training set is sufficiently large and model structure is correctly specified. For smaller training sets, the ARDL Ridge variant demonstrates improved predictive performance.
Rotation Double Random Forest Algorithm to Predict The Food Insecurity Status of Households Rais; Agus Mohamad Soleh; Budi Susetyo
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol 8 No 1 (2024): February 2024
Publisher : Ikatan Ahli Informatika Indonesia (IAII)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29207/resti.v8i1.5540

Abstract

The ensemble tree method has been proven to handle classification problems well. The strength of the ensemble tree technique lies in the diversity and independence between each tree. Increasing the diversity of mutually independent decision trees improves the performance of the model. Various studies propose the development of ensemble tree-based models by forming algorithms that create decision trees that are formed independently of each other and have various inputs. These include random forest (RF), rotation forest (RoF), double random forest (DRF), and the latest is rotation double random forest (RoDRF). RoDRF rotates or transforms data with the intent of producing better diversity among the learning base. RoDRF applies the concept of variable rotation to trees based on the DRF algorithm. Random rotations or transformations on different feature subspaces produce different projections, leading to better generalization or prediction performance. This research aims to compare the performance of RoDRF with the RF, RoF, and DRF models on unbalanced data in cases of food insecurity. Class imbalance will be handled with two methods, namely EasyEnsemble and SMOTE-NC. The research results show that the DRF's model with EasyEnsemble techniques produces a model with the best performance among several algorithms tested. Although the resulting precision is 0.62274 and the AUC value is 0.68501, the model can predict each class equally. All algorithms with EasyEnsemble treatment have average AUC values significantly different from each other based on statistical test results. This research also used SHAP to explain variables that significantly contribute to the household's food insecurity status model.
LR-GLASSO Method for Solving Multiple Explanatory Variables of the Village Development Index Yunus, M.; Soleh, Agus M; Saefuddin, Asep; Erfiani
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol 8 No 2 (2024): April 2024
Publisher : Ikatan Ahli Informatika Indonesia (IAII)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29207/resti.v8i2.5656

Abstract

Sustainable Development Goals (SDGs) are developments that maintain sustainable improvement in society’s economic, social, and environmental welfare. Kemendes PDTT RI has issued the Village Development Index (VDI) to provide information and the status of village progress to support village development to improve the National SDGS. Modeling with multiple explanatory variables causes a high correlation between explanatory variables, multicollinearity, and coefficient estimation results, which have a large variance and overfitting in the prediction results. The modeling solution uses LASSO and GLASSO. The binary categorical response data use binary logistic regression (LR), so LR-LASSO and LR-GLASSO are used. North Maluku Province has a VDI ranking that tends to fall in 2018-2022. On the basis of the mean and variance of the coefficient estimation results and misclassification errors, LR-GLASSO is better than LR-LASSO and LR. LR-GLASSO is recommended for analyzing VDI data because it has many explanatory variables and the correlation between them is relatively high. The Indonesian government recommendation, if it is to increase the status of VDI in Indonesia, especially in the north Maluku province, is to increase the number of electricity users, food and beverage stores, and other cooperatives. The Indonesian government also needs to pay attention to villages relatively far from the regent's office, between food and beverage stalls, and supporting health centers, because they still need to be developed compared to other villages, and more than 50% of the villages are underdeveloped. If the Village SDGs are formulated by increasing the VDI status, it will support the achievement of the SDGs goals nationally.