Claim Missing Document
Check
Articles

Comparison of Error Rate Prediction Methods in Binary Logistic Regression Modeling for Imbalanced Data Bahri Annur Sinaga; Dodi Vionanda; Dony Permana; Admi Salma
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/86

Abstract

Binary logistic regression is a regression analysis used in classification modeling. The performance of binary logistic regression can be seen from the accuracy of the model formed. Accuracy can be measured by predicting the error rate. One method of predicting the error rate that is often used is cross-validation. There are three algorithms in cross-validation: leave one out, hold out, and k-fold. Leave one out is a method that divides data based on the number of observations so that each observation has the opportunity to become testing data but requires a long time in the analysis process when the number of observations is large. Hold out is the simplest algorithm that only divides the data into two parts randomly, so there is a possibility that important data does not become training data. K-fold is an algorithm that divides data into several groups, but k-fold is not suitable for data that has a small number of observations. In reality, real data is often imbalanced. In logistic regression,when the data is increasingly imbalanced, the prediction results will approach the number of minority classes. This research focuses on the comparison of error rate prediction methods in binary logistic regression modeling with imbalanced data. This study uses three types of data, namely univariate, bivariate, and multivariate, which are generated by differences in population mean and correlation between independent variables.The results obtained show that the k-fold algorithm is the most suitable error rate prediction algorithm applied to binary logistic regression.
Perbandingan Metode Prediksi Laju Galat dalam Pemodelan Klasifikasi Algoritma C4.5 untuk Data Tidak Seimbang Yunistika Ilanda; Dodi Vionanda; Yenni Kurniawati; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/89

Abstract

Classification modeling can be formed using the C4.5 algorithm. The model formed by the C4.5 algorithm needs to be seen for its prediction accuracy using the error rate prediction method. Imbalanced data causes an increase in the classification error of the C4.5 algorithm because the prediction results do not represent the entire data and worsen the performance of the error rate prediction method. Meanwhile, the case of data with different correlations is carried out to find out whether different correlations affect the performance of the error rate prediction method. The purpose of the research is to find out the most suitable error rate prediction method applied to the C4.5 algorithm in the case of imbalanced data and the influence of different correlations. The results show that the K-Fold CV method is the most suitable prediction method applied to the C4.5 algorithm for imbalanced data cases compared to the HO and LOOCV methods. In addition, high correlation can worsen the performance of error rate prediction methods.
Comparison of Error Rate Prediction Methods in Binary Logistic Regression Model for Balanced Data Shavira Asysyifa S; Dodi Vionanda; Nonong Amalita; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/90

Abstract

Binary Logistic Regression is one of the statistical methods that can be  used to see the relations between dependent variable with some independent variables, where the dependent variable split into two categories, namely the category declaring a successful event and the category declaring a failed event. The performance of binary logistic regression can be seen from the accurary of the model. Accuracy can be measured by predicting the error rate. One method that can be used to predict error rate is cross validation. The cross validation method works by dividing the data into two parts, namely testing data and training data. Cross validation has several learning methods that are commonly used, namely Leave One Out (LOO), Hold out, and K-fold cross validation. LOO has unbiased estimation of accuracy but take a long time, hold out can avoid overfitting and works faster because no iterations, and k-fold cross validation has smaller error rate prediction. Meanwhile, data cases with different correlation are useful to find out the different correlations effect performance of error rate prediction method. In this study uses artificially generated data with a normal distribution, including univariate, bivariate, and multivariate datasets with various combination of mean differences and correlation. Considering these factors, this study focuses on comparing the three cross validation methods for predicting error rate prediction in binary logistic regression. This study finds out that k-fold cross validation method is the most suitable method to predict errors in binary logistic regression modeling for balanced data.
Classification of Coronary Heart Disease at Semen Padang Hospital using Algorithm Classification And Regression Trees (CART) defal aditya defran; Atus Amadi Putra; Dodi Vionanda; Tessy Octavia Mukhti
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/104

Abstract

Cardiovascular disease is a degenerative disease caused by decreased function of the heart and blood vessels. One of the heart diseases that is very popular today is coronary heart disease (CHD). The main factors that cause CHD include age, gender, hypertension, blood sugar and cholesterol. One method that can be used to group CHD is classification. Classification And Regression Trees (CART )is a decision tree that describes the relationship between a response variable and one or more predictor variables. The goal of CART is to obtain an accurate data group as a characteristic of a classification. Based on the results of the optimal tree, the attribute that is the main characteristic in classifying CHD patients at Semen Padang Hospital is age. The determination of the classification results using the confusion matrix produced an accuracy value of 66.67%, a sensitivity of 56.52% for classifying CHD patients, and a specificity of 84.61% for classifying non-CHD patients.
Prediction Of Bogor City Rainfall Parameters Using Long Short Term Memory (LSTM) Sherly Amora Jofipasi; Admi Salma; Dodi Vionanda; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/110

Abstract

Bogor is a city that has high intensity of rainfall and has erratic rainfall. So it is necessary to predict Bogor's rainfall. Rainfall prediction can be done using the LSTM algorithm. In the LSTM algorithm, there are neuron hidden layer and epoch parameters. Neuron hidden layer and epoch greatly affect the resulting prediction results, therefore it is necessery to determine the best neuron hidden layer and epoch values to produce good prediction results in Bogor rainfall. The prediction parameters results obtained by LSTM have worked well using optimal neuron hidden values of 256, optimal epoch of 150, MAPE of 1,64%, and the comparison of actual data patterns and prediction data already has the same data patterns.
Comparison of Error Prediction Methods in Claassification Modeling with CHAID Methods for Balanced Data Findri Wara Putri; Dodi Vionanda; Atus Amadi Putra; Fadhilah Fitri
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/116

Abstract

Chi-Squared Automatic Interaction Detection (CHAID) is an exploratory method for classifying data by building classification trees. The classification result are displayed in the form of a tree diagram model. After the model is formed, it is necessary to calculate the accuracy of the model. The goal is to see the performance of the model. The accuracy of this model can be determined by calculating the level of prediction error in the model. The error rate prediction method works by dividing data into training data and testing data. There are three methods in the error rate prediction method, such as Leave one out cross validation (LOOCV), Hold out, and k-fold cross validation. These methods have different performance in dividing data into training data and test data, so that each method has advantages and disadvantages. Therefore, a comparison of the three error rate prediction methods was carried out with the aim of determining the appropriate method for the CHAID. This research is included in experimental research and uses simulation data from data generation results in RStudio. This comparison is carried out by considering several factors, namely the marginal probability matrix and different correlations. The comparison results will be observed using a boxplot by looking at the median error rate and lowest variance. This research found that k-fold cross validation is the most suitable error rate prediction method applied to the CHAID method for balanced data.
Comparison of Error Rate Prediction in CART for Imbalanced Data Lifia Zullani; Dodi Vionanda; Syafriandi Syafriandi; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/117

Abstract

CART is one of the tree based classification algorithms. CART is a tree consisting of root nodes, internal nodes, and terminal nodes. The accuracy of the model in CART can be calculated by measuring prediction errors in the model. One common method used to predict error rates is cross-validation. There are three cross-validation algorithms, namely leave one out, hold out, and k-fold cross-validation. These methods have different performance in dividing data into training data and testing data, so there are advantages and disadvantages to each method. Every algorithm has its shortcomings; hold out cannot guarantee that the training set represents the entire dataset, leave one out is very time-consuming and requires significant computation because it has to train the model as many times as there are data points, and k-fold provides longer computation time because the training algorithm must be run k times. In reality, the data often encountered is imbalanced. Imbalanced data refers to data with a different number of observations in each class. In CART, imbalanced data affects the prediction results. This research focuses on comparing error rate prediction methods in the CART model with imbalanced data. The study uses three types of data: univariate, bivariate, and multivariate, obtained from differences in population means and correlations between independent variables. The results obtained indicate that the k-fold algorithm is the most suitable error rate prediction algorithm applied to CART with imbalanced data.
Diagnosis of the type of delivery of pregnant women at Semen Padang Hospital Using the C4.5 Method rama novialdi; Dony Permana; Dodi Vionanda; Fadhilah Fitri
UNP Journal of Statistics and Data Science Vol. 2 No. 1 (2024): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol2-iss1/130

Abstract

The health of the mother and fetus is very important, but there are many challenges and risks associated with pregnancy and childbirth. According to WHO, in 2020 there were 287,000 cases of women dying during pregnancy and childbirth. Causative factors that affect the type of delivery include the age of pregnant women, MGG, systole, diastole, and pulse. One method that can be used to group the types of childbirth of pregnant women is classification. C4.5 is one of the methods used in forming decision trees to produce decisions. The purpose of C4.5 is to obtain attributes that will be the main criteria in the classification. Based on optimal tree results, the attribute that is the main criterion in classifying the type of delivery of pregnant women who give birth by caesar section and normal delivery at Semen Padang Hospital is MGG. Determination of classification results using confusion matrix resulted in an accuracy value of 74%, sensitivity of 80% to classify the type of delivery of pregnant women who gave birth caesar, and specificity of 66.67% to classify the type of delivery of pregnant women who gave birth normally.
Forecasting Gold Prices in Indonesia using Support Vector Regression with the Grid Search Algorithm Syahfitrri, Nindi; Nonong Amalita; Dodi Vionanda; Zamahsary Martha
UNP Journal of Statistics and Data Science Vol. 2 No. 1 (2024): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol2-iss1/145

Abstract

Investment is an effort to increase economic growth in Indonesia.  A popular investment in the community is gold investment.  The value of gold investments tends to increase but is not immune from price fluctuations, therefore it is important to forecast the price of gold in Indonesia. The method that can be used to make this forecast is Support Vector Regression (SVR).  SVR is a method that looks for a function that has a deviation of no more than ε to get the target value from all training data. The best SVR model with a linear kernel was obtained from a combination of parameters C=0,0625 and ε=0,001 with a RMSE value of 0,19734 and a value of 0,974112.  So, the SVR method is appropriate to use for forecasting gold prices in Indonesia.
Comparison of Modeling Infant Mortality Rate in West Sumatra and West Java Province in 2021 Using Negative Binomial Regression Afdhal, Afdhal Rezeki; Fadhilah Fitri; Dodi Vionanda; Dony Permana
UNP Journal of Statistics and Data Science Vol. 2 No. 2 (2024): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol2-iss2/156

Abstract

In Poisson regression analysis, there is an assumption that must be met, namely equidispersion (the variance value of the response variable is the same as the mean). In reality, conditions like this very rarely occur because overdispersion usually occurs (the variance value of the response variable is greater than the mean). One way to overcome this problem is to use the Negative Binomial regression method. The aim of this article is to obtain the best modeling results in Negative Binomial regression analysis to overcome overdispersion in cases of infant mortality in West Sumatra Province and West Java Province. The model obtained using Negative Binomial regression produces an AIC value in West Sumatra province of 192.65 which is smaller than the AIC value in West Java Province it was 283.47. Based on the Negative Binomial regression model equation obtained in West Sumatra Province, it can be explained that the number of health centers (X3) has a significant influence on the infant mortality rate and in West Java Province it can be explained that the number of medical personnel (X1) has a significant influence on the infant mortality rate.