Claim Missing Document
Check
Articles

Perbandingan Algoritma Pohon dengan Beberapa Skenario Pelabelan untuk Analisis Sentimen pada Aplikasi Milik Pemerintah/BUMN Fitrianto, Anwar; Rizki Manaf, Silmi Anisa; Soleh, Agus Mohamad
JEPIN (Jurnal Edukasi dan Penelitian Informatika) Vol 10, No 1 (2024): Volume 10 No 1
Publisher : Program Studi Informatika

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.26418/jp.v10i1.73512

Abstract

Berkembangnya era digitalisasi mengakibatkan banyaknya inovasi yang diupayakan untuk mempermudah aktivitas masyarakat di berbagai bidang, salah satunya yaitu adanya aplikasi yang menunjang agar menjadi lebih efisien dan dapat diakses dari mana saja. Aplikasi milik pemerintah dan BUMN sebagai perusahaan berskala nasional cenderung belum banyak diketahui dan banyak yang memiliki rating rendah disertai dengan berbagai macam ulasan pengguna aplikasi. Analisis sentimen merupakan analisis yang cocok untuk menganalisis ulasan dari aplikasi yang dipilih. Data yang digunakan adalah ulasan aplikasi InfoBMKG, BPOM Mobile, MyIndihome, dan MyPertamina. Penelitian bertujuan untuk membandingkan performa algoritma double random forest  dan algoritma berbasis pohon lain yaitu decision tree, extra trees, dan random forest berdasarkan tingkat ketepatan performa akurasi model. Pelabelan data berdasarkan rating aplikasi, lexicon-based, dan sentiment scoring dengan peubah prediktor dihasilkan dari tokenisasi unigram yang diberi bobot dengan TF-IDF. Setiap observasi data dikategorikan ke dalam kelas positif, netral, dan negatif. Hasil penelitian menunjukkan algoritma extra trees dan metode pelabelan sentiment scoring mampu menghasilkan performa terbaik dengan nilai rata-rata akurasi mencapai 80 – 84% pada tiap aplikasi yang dipilih.
BHF and copula models in small area estimation for household per capita expenditure in Bogor District BELINDA, NADIRA SRI; NOTODIPUTRO, KHAIRIL ANWAR; SOLEH, AGUS MOHAMAD
Jurnal Natural Volume 24 Number 2, June 2024
Publisher : Universitas Syiah Kuala

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24815/jn.v24i2.37278

Abstract

Small area statistics are required when the sample size is small to produce estimates with adequate precision. The assumptions underlying Battese Harter Fuller (BHF) unit-level models may often be unrealistic in some applications. Copula is an alternative approach when the assumptions are violated. This research discusses the performance of BHF and Copula in small area estimation (SAE) for estimating household per capita expenditure in sub-district levels. This study presents household per capita expenditure, which has a skewed distribution. Due to the fact that the data contains outliers, an appropriate method to handle outliers is also considered. In this research, the Gaussian and the Clayton Copulas are used. The results showed that the performance of BHF was better than Gaussian and Clayton Copulas, as indicated by small root mean square error (RMSE) with an average of 1.14, while the average RMSE of Gaussian copula was 2.71 and Clayton copula was 2.63. Furthermore, the coefficient of variation (CV) of BHF was also smaller compared to Gaussian and Clayton Copulas, and the resulting estimates can be categorized as reliable based on the CV of less than 25%.
Manifold Learning and Undersampling Approaches for Imbalanced Class Sentiment Classification Jumansyah, L. M. Risman Dwi; Soleh, Agus Mohamad; Syafitri, Utami Dyah
Knowledge Engineering and Data Science Vol 7, No 2 (2024)
Publisher : Universitas Negeri Malang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.17977/um018v7i22024p139-151

Abstract

Movie reviews are crucial in determining a film's success by influencing audience decisions. Automating sentiment classification is essential for efficient public opinion analysis. However, it faces challenges such as high-dimensional data and imbalanced class distributions. This study addresses these issues by applying manifold learning techniques, Principal Component Analysis (PCA) and Laplacian Eigenmaps (LE) to reduce data complexity and undersampling strategies (Random Undersampling (RUS) and EasyEnsemble) to balance data and improve predictions for both sentiment classes. On reviews of The Raid 2: Berandal, EasyEnsemble achieved the highest average G-Mean of 0.694 using Term Frequency-Inverse Document Frequency (TF-IDF) features with a linear kernel without dimensionality reduction. RUS provided balanced but inconsistent results, while Review of Systems (ROS) combined with PCA (85% variance cumulative) improved predictions for negative reviews. Laplacian Eigenmaps were effective for negative reviews with 500 dimensions but less accurate for positive ones. This study highlights EasyEnsemble's superior performance in addressing the class imbalance, though optimization with manifold learning remains challenging.
Loan Approval Classification Using Ensemble Learning on Imbalanced Data Anadra, Rahmi; Sadik, Kusman; Soleh, Agus M; Astari, Reka Agustia
Enthusiastic : International Journal of Applied Statistics and Data Science Volume 4 Issue 2, October 2024
Publisher : Universitas Islam Indonesia

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.20885/enthusiastic.vol4.iss2.art1

Abstract

Loan processing is an important aspect of the financial industry, where the right decisions must be made to determine loan approval or rejection. However, the issue of default by loan applicants has become a significant concern for financial institutions. Hence, ensemble learning needs to be used with random forest and Extreme Gradient Boosting (XGBoost) algorithms. Unbalanced data are handled using the Synthetic Minority Over-sampling Technique (SMOTE). This research aimed to improve accuracy and precision in credit risk assessment to reduce human workload. Both algorithms used a dataset of 4,296 with 13 variables relevant to making loan approval decisions. The research process involved data exploration, data preprocessing, data sharing, model training, model evaluation with accuracy, sensitivity, specificity, and F1-score, model selection with 10-fold cross-validation, and important variables. The results showed that XGBoost with imbalanced data handling had the highest accuracy rate of 98.52% and a good balance between sensitivity of 98.83%, specificity of 98.01, and F1-score of 98.81%. The most important variables in determining loan approval are credit score, loan term, loan amount, and annual income.
Characteristics of Machine Learning-based Univariate Time Series Imputation Method Ramadhani, Dini; Soleh, Agus Mohamad; Erfiani, Erfiani
JUITA: Jurnal Informatika JUITA Vol. 12 No. 2, November 2024
Publisher : Department of Informatics Engineering, Universitas Muhammadiyah Purwokerto

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30595/juita.v12i2.23453

Abstract

Handling missing values in univariate time series analysis poses a challenge, potentially leading to inaccurate conclusions, especially with frequently occurring consecutive missing values. Machine Learning-based Univariate Time Series Imputation (MLBUI) methods, utilizing Random Forest Regression (RFR) and Support Vector Regression (SVR), aim to address this challenge. Considering factors such as time series patterns, missing data patterns, and volume, this study explores the performance of MLBUI in simulated Autoregressive Integrated Moving Average (ARIMA) datasets. Various missing data scenarios (6%, 10%, and 14%) and model scenarios (Autoregressive (AR) models: AR(1) and AR(2); Moving Average (MA) models: MA(1) and MA(2); Autoregressive Moving Average (ARMA) models: ARMA(1,1) and ARMA(2,2); and Autoregressive Integrated Moving Average (ARIMA) models: ARIMA(1,1,1) and ARIMA(1,2,1)) with different standard deviations (0.5, 1, and 2) were examined. Five comparative methods were also used in this research, including Kalman StructTS, Kalman Auto-ARIMA, Spline Interpolation, Stine Interpolation, and Moving Average. The research findings indicate that MLBUI performs exceptionally well in imputing successive missing values. The results of this study indicate that the performance of MLBUI in imputing consecutive missing values, based on MAPE, yielded values of less than 10% across all scenarios used.
BINOMIAL REGRESSION IN SMALL AREA ESTIMATION METHOD FOR ESTIMATE PROPORTION OF CULTURAL INDICATOR Yudistira Yudistira; Anang Kurnia; Agus Mohamad Soleh
Indonesian Journal of Statistics and Applications Vol 2 No 2 (2018)
Publisher : Departemen Statistika, IPB University dengan Forum Perguruan Tinggi Statistika (FORSTAT)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29244/ijsa.v2i2.63

Abstract

In sampling survey, it was necessary to have sufficient sample size in order to get accurate direct estimator about parameter, but there are many difficulties to fulfill them in practice. Small Area Estimation (SAE) is one of alternative methods to estimate parameter when sample size is not adequate. This method has been widely applied in such variation of model and many fields of research. Our research mainly focused on study how SAE method with binomial regression model is applied to obtained estimate proportion of cultural indicator, especially to estimate proportion of people who appreciate heritages and museums in each regency/city level in West Java Province. Data analysis approach used in our research with resurrected data and variables in order to be compared with previous research. The result later showed that binomial regression model could be used to estimate proportion of cultural indicator in Regency/City in Indonesia with better result than direct estimation method.
PENENTUAN NILAI AMBANG BATAS SEBARAN PARETO TERAMPAT DENGAN MEASURE OF SURPRISE Yumna Karimah; Aji Hamim Wigena; Agus Mohamad Soleh
Indonesian Journal of Statistics and Applications Vol 3 No 2 (2019)
Publisher : Departemen Statistika, IPB University dengan Forum Perguruan Tinggi Statistika (FORSTAT)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29244/ijsa.v3i2.284

Abstract

Extreme rainfall can result in natural disasters such as floods and landslides. These natural disasters will cause damage and losses to the surrounding environment. Prevention of damage from natural disasters can be done by extreme rainfall estimation. Estimates of extreme rainfall are based on Generalized Pareto Distribution (GPD) which requires threshold value information. The threshold value can be determined by two methods, namely Mean Residual Life Plot (MRLP) and Measure of Surprise (MOS). The purpose of this study is to determine and compare the threshold values ​​of MRLP and MOS. The data used are 10-day and monthly rainfall data. The results of this study indicate that the procedure of MOS is shorter and easier than that of MRLP. Based on the cross validation result, the log-likelihood value of MOS is larger than that of MRLP, then MOS is better than MRLP.
Study of Spatial Autoregressive Regression With Heteroskedasticity Using the Generalized Method of Moments and Bayesian Approach : Kajian Regresi Spasial Autoregresif dengan Heteroskedastik Menggunakan Generalized Method of Moments dan Pendekatan Bayes Abialam Koesnandy H; Agus Mohamad Soleh; Farit Mochamad Afendi
Indonesian Journal of Statistics and Applications Vol 8 No 1 (2024)
Publisher : Departemen Statistika, IPB University dengan Forum Perguruan Tinggi Statistika (FORSTAT)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29244/ijsa.v8i1p58-69

Abstract

Spatial dependence and spatial heteroskedasticity are problems in spatial regression. Spatial autoregressive regression (SAR) concerns only to the dependence on lag. The estimation of SAR parameters containing heteroskedasticity using the maximum likelihood estimation (MLE) method provides biased and inconsistent estimators. The alternative method that can be used are generalized method of moments (GMM) and Bayesian method. GMM uses a combination of linear and quadratic moment functions simultaneously so that the computation is easier than MLE. Bayesian method solves heteroskedasticity by modeling the structure of variance-covariance matrix. The bias are used to evaluate the GMM and Bayes in estimating parameters of SAR model with heteroskedasticity disturbances in simulation data. The results show that GMM and Bayes provides the bias of parameter estimates relatively consistent and smaller with larger number of observations. GMM and Bayes methods are applied to district/city GRDP data in Indonesia. The result show GMM method with Eksponential Distance Weights (EDW) matrix produces the minimum variance and the largest pseudo-R2
Klasifikasi Halaman SEO Berbasis Machine Learning Melalui Mutual Information dan Random Forest Feature Importance NURADILLA, SITI; SADIK, KUSMAN; SUHAENI, CICI; SOLEH, AGUS M
MIND (Multimedia Artificial Intelligent Networking Database) Journal Vol 10, No 1 (2025): MIND Journal
Publisher : Institut Teknologi Nasional Bandung

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.26760/mindjournal.v10i1.114-129

Abstract

AbstrakProses optimasi SEO melibatkan banyak faktor yang saling terkait, sehingga sulit bagi tim SEO dalam menentukan halaman mana yang memerlukan perbaikan lebih lanjut. Penelitian ini bertujuan untuk mengembangkan model berbasis machine learning yang tidak hanya akurat dalam mengklasifikasikan halaman, tetapi juga efisien dalam memilih fitur yang paling informatif. Metode yang digunakan dalam penelitian ini melibatkan seleksi fitur menggunakan Mutual Information (MI) dan Random Forest Feature Importance (RFFI) untuk mengidentifikasi faktor-faktor yang paling penting untuk optimasi SEO, yang dimodelkan menggunakan Random Forest dan Weighted Voting Ensemble (WVE). Model dievaluasi berdasarkan Accuracy, Precision, Recall, dan ROC AUC. Hasil penelitian menunjukkan bahwa model Random Forest dengan 20 fitur berdasarkan RFFI, memberikan performa terbaik dengan ROC AUC sebesar 75.87%, Accuracy 77,74%, Precision 60,51%, dan Recall 71.29%. Model mampu membedakan secara efektif halaman yang membutuhkan optimasi SEO atau tidak.Kata kunci: Feature Importance, Random Forest, SEO, Seleksi Variabel, WVEAbstractThe SEO optimization process involves many interrelated factors, making it challenging to identify which pages need further improvement. This study proposes a machine learning-based model that is accurate in classifying web pages and efficient in selecting the most relevant features. Feature selection is performed using Mutual Information (MI) and Random Forest Feature Importance (RFFI) to identify key factors for SEO optimization, followed by modeling with Random Forest and Weighted Voting Ensemble (WVE). The model is evaluated using Accuracy, Precision, Recall, and ROC AUC. Results indicate that the Random Forest model with 20 features selected via RFFI delivers the best performance, achieving a ROC AUC of 75.87%, Accuracy of 77.74%, Precision of 60.51%, and Recall of 71.29%. The model effectively distinguishes between pages that require SEO optimization and those that do not.Keywords: Feature Importance, Random Forest, SEO, Variable Selection, WVE
Multilevel Semiparametric Modeling with Overdispersion and Excess Zeros on School Dropout Rates in Indonesia Tarida, Arna Ristiyanti; Djuraidah, Anik; Soleh, Agus Mohamad
JTAM (Jurnal Teori dan Aplikasi Matematika) Vol 9, No 3 (2025): July
Publisher : Universitas Muhammadiyah Mataram

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.31764/jtam.v9i3.30102

Abstract

This study aims to identify key factors influencing high school dropout rates in Indonesia by applying advanced statistical modeling that accounts for complex data characteristics. Dropout data often display overdispersion (variability greater than expected) and excess zeros (many students not dropping out), which, if ignored, can bias conclusions.  To address this, we compare parametric models, Zero-Inflated Poisson Mixed Model (ZIPMM), Zero-Inflated Generalized Poisson Mixed Model (ZIGPMM), and Zero-Inflated Negative Binomial Mixed Model (ZINBMM), with their semiparametric counterparts (SZIPMM, SZIGPMM, SZINBMM). The semiparametric models use B-spline functions to capture nonlinear relationships between predictors and dropout rates, with flexibility. Model performance was evaluated using Akaike Information Criterion (AIC) and Root Mean Square Error (RMSE) across 100 simulation repetitions to ensure robustness. Results show that the semiparametric ZIGPMM (SZIGPMM) outperformed other models, achieving the lowest average AIC (18969.62), suggesting the best trade-off between model fit and complexity. The optimal spline configuration used knot point 2 and order 3, with a Generalized Cross-Validation (GCV) score of 9.4107. Key predictors of dropout include school status (public or private), student-teacher ratio, distance from home to school, parental education level, parental employment status, and number of siblings. These findings provide actionable insights for education policymakers, emphasizing the need to address structural and socioeconomic barriers to reduce dropout rates effectively.