cover
Contact Name
Tessy Octavia Mukhti
Contact Email
tessyoctaviam@fmipa.unp.ac.id
Phone
+6282283838641
Journal Mail Official
tessyoctaviam@fmipa.unp.ac.id
Editorial Address
LPPM Universitas Negeri Padang, Jalan Prof. Dr. Hamka, Air Tawar Barat, Kota Padang, Sumatera Barat 25131
Location
Kota padang,
Sumatera barat
INDONESIA
UNP Journal of Statistics and Data Science
ISSN : -     EISSN : 2985475X     DOI : 10.24036/ujsds
UNP Journal of Statistics and Data Science is an open access journal (e-journal) launched in 2022 by Department of Statistics, Faculty of Science and Mathematics, Universitas Negeri Padang. UJSDS publishes scientific articles on various aspects related to Statistics, Data Science, and its application. Articles can be in the form of research results, case studies, or literature reviews. All papers were reviewed by peer reviewers consisting of experts and academicians across universities.
Articles 213 Documents
Comparison of Error Rate Prediction Methods in Classification Modeling with the CHAID Method for Imbalanced Data Seif Adil El-Muslih; Dodi Vionanda; Nonong Amalita; Admi Salma
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/81

Abstract

CHAID (Chi-Square Automatic Interaction Detection) is one of the classification algorithms in the decision tree method. The classification results are displayed in the form of a tree diagram model. After the model is formed, it is necessary to calculate the accuracy of the model. The aims is to see the performance of the model. The accuracy of this model can be done by calculating the predicted error rate in the model. There are three methods, such as Leave one out cross-validation (LOOCV), Hold-out, and K-fold cross-validation. These methods have different performances in dividing data into training and testing data, so each method has advantages and disadvantages. Imbalanced data is data that has a different number of class observations. In the CHAID method, imbalanced data affects the prediction results. When the data is increasingly imbalanced the prediction result will approach the number of minority classes. Therefore, a comparison was made for the three error rate prediction methods to determine the appropriate method for the CHAID method in imbalanced data. This research is included in experimental research and uses simulated data from the results of generating data in RStudio. This comparison was made by considering several factors, for the marginal opportunity matrix, different correlations, and several observation ratios. The results of the comparison will be observed using a boxplot by looking at the median error rate and the lowest variance. This research finds that K-fold cross-validation is the most suitable error rate prediction method applied to the CHAID method for imbalanced data.
Penerapan Metode Self Organizing Maps (SOM) dalam Pengklasteran Berdasarkan Indikator Pemerlu Pelayanan Kesejahteraan Sosial (PPKS) Provinsi Jawa Barat Maulidya Hernanda; Admi Salma; Dodi Vionanda; Zamahsary Martha
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/82

Abstract

The province of West Java in Indonesia has witnessed a rise in its impoverished population. Being the most populous province in Indonesia, West Java faces complex social welfare issues due to its large population. This study aims to conduct cluster analysis to identify district/city clusters in West Java province and determine the characteristics of these groups based on the indicators of the Need for Social Welfare Services (PPKS). The self-organizing maps (SOM) method will be utilized for this analysis. SOM is an unsupervised learning method, in which the training process does not require supervision (target output) which produces input representations in two dimensions (maps). In this study, the results obtained were 3 clusters where cluster 1 which consisted of 24 districts/cities had a relatively high average score for each member in the cluster, then cluster 2 which consisted of Cianjur and Karawang districts showed high social welfare problems compared to other clusters, and for cluster 3 which consists of Bandung regency, it shows that the most prominent social welfare problem is the indicator of socio-economic vulnerability of women, with an average of 34,549 cases/year. Based on the results obtained, it is necessary to make the right decisions regarding allocations, resources, more effective service planning, and the development of more targeted social welfare programs.
Implementation of the Self Organizing Maps (SOM) Method for Grouping Provinces in Indonesia Based on the Earthquake Disaster Impact Ihsan Dermawan; Admi Salma; Yenni Kurniawati; Tessy Octavia Mukhti
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/83

Abstract

Indonesia's strategic geological location causes Indonesia to be frequently hit by earthquake disasters, which are a series of events that disturb and threaten the safety of life and cause material and non-material losses. The number of earthquake events in Indonesia causes casualties, both fatalities and injuries, destroying the surrounding area as well as destroying infrastructure and causing property losses. Therefore, it is important to cluster the impact of earthquake disasters in Indonesia as a disaster mitigation effort in order to determine the characteristics of each province. The clustering method used is Kohonen Self Organizing Maps (SOM). SOM is a high-dimensional data visualization technique into a low-dimensional map. The results of this study obtained 3 clusters with the characteristics of each cluster. The first cluster with low impact of earthquake disaster consists of 32 provinces. The second cluster with moderate impact consists of 1 province characterized by the highest number of missing victims and the highest number of injured victims. The third cluster with a high impact consists of 1 province with the most prominent characteristics being the number of earthquake events, the number of deaths, the number of injured, the number of displaced, the number of damaged houses, the number of damaged educational facilities, the number of damaged health facilities and the number of damaged worship facilities is the highest of the other clusters.
Comparing Classification and Regression Tree and Logistic Regression Algorithms Using 5×2cv Combined F-Test on Diabetes Mellitus Dataset Fashihullisan; Dodi Vionanda; Yenni Kurniawati; Fadhilah Fitri
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/84

Abstract

Classification is the process of finding a model that describes and distinguishes data classes that aim to be used to predict the class of objects whose class labels are unknown. There are several algorithms in classification, such as classification trees and regression trees (CART) and logistic regression. The k-fold cross validation method has a weakness for algorithm comparison problems it is possible at different folds to produce different error predictions, so that the results of comparing algorithm performance will also be different. There for in the problem of comparison of algorithms, the researcher will apply the 52cv t test method and the 52cv combined F test. Out of 100 iterations the 10-fold cross validation method was only consistent three times which shows that the k-fold cross validation method has poor consistency in comparing the CART algorithm and logistic regression for diabetes mellitus data. In addition, 52cv combined F test and 52cv t test methods that have been carried out show that 52cv combined F test is better used to get conclusions from the results of a comparison of the two algorithms because it only produces one decision, in contrast to 52cv t test which has the possibility to get different decisions from 10 test statistics which results makes it difficult for researchers to draw conclusions in comparing the cart algorithm and logistic regression
Emprical Study for Algorithms Comparison of Classification and Regression Tree and Logistic Regression Using Combined 5×2cv F Test Fayza Annisa Febrianti; Dodi Vionanda; Yenni Kurniawati; Fadhilah Fitri
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/85

Abstract

Classification is a method to estimate the class of an object based on its characteristics. Several learning algorithms can be applied in classification, such as Classification and Regression Tree (CART) and logistic regression. The main goal of classification is to find the best learning algorithm that can be applied to get the best classifier. In comparing two learning algorithms, a direct comparison by seeing the smaller prediction error rate may be possible when the difference is very clear. In this case, direct comparison is misleading and resulting inadequate conclusions. Therefore, a statistical test is needed to determine whether the difference is real or random. The results of the 5×2cv paired t-test sometimes reject and sometimes fail to reject the hypothesis. It is distracting because the changing of the error rate difference should not affect the test result. Meanwhile, the overall results of the combined 5×2cv F test show that the tests fail to reject the hypothesis. This indicates that CART and logistic regression perform identically in this case.
Comparison of Error Rate Prediction Methods in Binary Logistic Regression Modeling for Imbalanced Data Bahri Annur Sinaga; Dodi Vionanda; Dony Permana; Admi Salma
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/86

Abstract

Binary logistic regression is a regression analysis used in classification modeling. The performance of binary logistic regression can be seen from the accuracy of the model formed. Accuracy can be measured by predicting the error rate. One method of predicting the error rate that is often used is cross-validation. There are three algorithms in cross-validation: leave one out, hold out, and k-fold. Leave one out is a method that divides data based on the number of observations so that each observation has the opportunity to become testing data but requires a long time in the analysis process when the number of observations is large. Hold out is the simplest algorithm that only divides the data into two parts randomly, so there is a possibility that important data does not become training data. K-fold is an algorithm that divides data into several groups, but k-fold is not suitable for data that has a small number of observations. In reality, real data is often imbalanced. In logistic regression,when the data is increasingly imbalanced, the prediction results will approach the number of minority classes. This research focuses on the comparison of error rate prediction methods in binary logistic regression modeling with imbalanced data. This study uses three types of data, namely univariate, bivariate, and multivariate, which are generated by differences in population mean and correlation between independent variables.The results obtained show that the k-fold algorithm is the most suitable error rate prediction algorithm applied to binary logistic regression.
Perbandingan Metode Prediksi Laju Galat dalam Pemodelan Klasifikasi Algoritma C4.5 untuk Data Tidak Seimbang Yunistika Ilanda; Dodi Vionanda; Yenni Kurniawati; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/89

Abstract

Classification modeling can be formed using the C4.5 algorithm. The model formed by the C4.5 algorithm needs to be seen for its prediction accuracy using the error rate prediction method. Imbalanced data causes an increase in the classification error of the C4.5 algorithm because the prediction results do not represent the entire data and worsen the performance of the error rate prediction method. Meanwhile, the case of data with different correlations is carried out to find out whether different correlations affect the performance of the error rate prediction method. The purpose of the research is to find out the most suitable error rate prediction method applied to the C4.5 algorithm in the case of imbalanced data and the influence of different correlations. The results show that the K-Fold CV method is the most suitable prediction method applied to the C4.5 algorithm for imbalanced data cases compared to the HO and LOOCV methods. In addition, high correlation can worsen the performance of error rate prediction methods.
Comparison of Error Rate Prediction Methods in Binary Logistic Regression Model for Balanced Data Shavira Asysyifa S; Dodi Vionanda; Nonong Amalita; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/90

Abstract

Binary Logistic Regression is one of the statistical methods that can be  used to see the relations between dependent variable with some independent variables, where the dependent variable split into two categories, namely the category declaring a successful event and the category declaring a failed event. The performance of binary logistic regression can be seen from the accurary of the model. Accuracy can be measured by predicting the error rate. One method that can be used to predict error rate is cross validation. The cross validation method works by dividing the data into two parts, namely testing data and training data. Cross validation has several learning methods that are commonly used, namely Leave One Out (LOO), Hold out, and K-fold cross validation. LOO has unbiased estimation of accuracy but take a long time, hold out can avoid overfitting and works faster because no iterations, and k-fold cross validation has smaller error rate prediction. Meanwhile, data cases with different correlation are useful to find out the different correlations effect performance of error rate prediction method. In this study uses artificially generated data with a normal distribution, including univariate, bivariate, and multivariate datasets with various combination of mean differences and correlation. Considering these factors, this study focuses on comparing the three cross validation methods for predicting error rate prediction in binary logistic regression. This study finds out that k-fold cross validation method is the most suitable method to predict errors in binary logistic regression modeling for balanced data.
Application of the Fuzzy Time Series-markov Chain Method to the Rupiah Exchange Rate Against the US Dollar (USD) rahmad revi fadillah; Dony Permana; Yenni Kurniawati; Admi Salma
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/91

Abstract

The exchange rate plays an important role in evaluating the Indonesian economy due to how much it affects the nation's overall financial situation. Activities for projecting future exchange rates can be conducted based on their dynamic characteristics. The purpose of this study is to predict the exchange rate of the Indonesian Rupiah (IDR) against the United States Dollar (USD) using the Fuzzy Time Series Markov chain (FTS-MC) method. Researchers apply the FTS-MC approach to analyze the connection between every bit of historical data and the direction in which it moved in order to forecast future data movements. While the rupiah exchange rate Forecast against the USD between January 2 and January 31, 2023, with a MAPE value of 2.41% and a forecast accuracy score of 97.58% result. During up to 8 forecasted periods, the forecasting value gained by the FTS-MC approach is close to the actual value, and the next period is higher than the current value. The forecasting results graph further shows that the FTS-MC approach gives forecast values fluctuate around IDR15,800.
Penerapan Metode Regresi Kuantil pada Data yang Mengandung Outlier untuk Tingkat Kejahatan di Jabodetabek Arssita Nur Muharromah; Zamahsary Martha; Dony Permana; Tessy Octavia Mukhti
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/94

Abstract

The problem of crime is increasingly widespread in Indonesia. The crime rate in Jabodetabek is the second highest in Indonesia. In this study containing outliers, the appropriate method for this research is quantile regression. Quantile regression is the development of median regression or the Least Absolute Deviation (LAD) method which is useful for dividing data into two parts to minimize errors. however, this LAD is considered not good for modeling, therefore comes the quantile regression. Quantile regression is useful for overcoming the problem of unfulfilled assumptions in classical regression, namely the phenomenon of heteroscedasticity and quantile regression can model data that contains outliers. The quantile regression method approach is to separate or divide the data into certain parts or quantiles where it is suspected that there are differences in estimated values. The resulting measurement of the goodness of the model uses the coefficient of determination or R2 in each quantile. In this study, five quantiles were used, namely 0,05; 0,25; 0,50; 0,75; and 0,95. From the results of the analysis it is known that the best parameter estimation model is found in the 0,95 quantile with all independent variables having a significant effect on the dependent variable (crime rate). whereas in the 0,25 and 0,50 quantiles there are no independent variables that have a significant effect, this may be due to the influence of other factors not present in the study that affect each quantile.

Page 5 of 22 | Total Record : 213