Claim Missing Document
Check
Articles

Perbandingan Metode Prediksi Laju Galat dalam Pemodelan Klasifikasi Algoritma C4.5 untuk Data Tidak Seimbang Yunistika Ilanda; Dodi Vionanda; Yenni Kurniawati; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/89

Abstract

Classification modeling can be formed using the C4.5 algorithm. The model formed by the C4.5 algorithm needs to be seen for its prediction accuracy using the error rate prediction method. Imbalanced data causes an increase in the classification error of the C4.5 algorithm because the prediction results do not represent the entire data and worsen the performance of the error rate prediction method. Meanwhile, the case of data with different correlations is carried out to find out whether different correlations affect the performance of the error rate prediction method. The purpose of the research is to find out the most suitable error rate prediction method applied to the C4.5 algorithm in the case of imbalanced data and the influence of different correlations. The results show that the K-Fold CV method is the most suitable prediction method applied to the C4.5 algorithm for imbalanced data cases compared to the HO and LOOCV methods. In addition, high correlation can worsen the performance of error rate prediction methods.
Comparison of Error Rate Prediction Methods in Binary Logistic Regression Model for Balanced Data Shavira Asysyifa S; Dodi Vionanda; Nonong Amalita; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/90

Abstract

Binary Logistic Regression is one of the statistical methods that can be  used to see the relations between dependent variable with some independent variables, where the dependent variable split into two categories, namely the category declaring a successful event and the category declaring a failed event. The performance of binary logistic regression can be seen from the accurary of the model. Accuracy can be measured by predicting the error rate. One method that can be used to predict error rate is cross validation. The cross validation method works by dividing the data into two parts, namely testing data and training data. Cross validation has several learning methods that are commonly used, namely Leave One Out (LOO), Hold out, and K-fold cross validation. LOO has unbiased estimation of accuracy but take a long time, hold out can avoid overfitting and works faster because no iterations, and k-fold cross validation has smaller error rate prediction. Meanwhile, data cases with different correlation are useful to find out the different correlations effect performance of error rate prediction method. In this study uses artificially generated data with a normal distribution, including univariate, bivariate, and multivariate datasets with various combination of mean differences and correlation. Considering these factors, this study focuses on comparing the three cross validation methods for predicting error rate prediction in binary logistic regression. This study finds out that k-fold cross validation method is the most suitable method to predict errors in binary logistic regression modeling for balanced data.
Classification of Nutrition Problems for Indonesian Toddler With Decision Tree Algorithm C4.5 Nadha Ovella Syaqhasdy; Zamahsary Martha; Nonong Amalita; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/98

Abstract

Having excellent human resources is essential for Indonesia's development. The development of Indonesia is the key to improving the quality of life for its citizens, and a focus on this development can have a positive impact on the health and economy of the community. A healthy and educated generation is fundamental for the expected progress of this nation, as nutritional status is a significant factor affecting the quality of human resources. Nutritional problems can lead to serious consequences, such as abnormal physical growth, a decline in IQ quality, and even death. The objective of this research is to analyze the factors that influence the nutritional status of toddlers by classifying each variable using a decision tree. A decision tree is a flowchart resembling a branching tree structure. The C4.5 algorithm was utilized in this study. This algorithm can process both numeric and categorical data, handle missing attribute values, and generate easily interpretable rules. After conducting the analysis, it was found that the decision tree's results indicated that the attribute "Stunting < 20%" is a determining factor for acutechronic malnutrition issues in toddlers. There are 392 districts and cities in Indonesia where the prevalence of stunted toddler nutritional status is less than 20%. The model created using the C4.5 algorithm was evaluated using a confusion matrix, resulting in an accuracy of 99.8% and a kappa value close to 1. This indicates that the model is capable of accurately classifying toddler nutrition problems in Indonesia.
Fuzzy Geographically Weighted Clustering Method for Grouping Provinces in Indonesia Based on Welfare Indicators Aspects of Information and Communication Technology (ICT) Hefiani Mustika Hasanah; Dina Fitria; Dony Permana; Zamahsary Martha
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/108

Abstract

The welfare of the people is a task and goal that must be realized by the Republic of Indonesia. To find out the condition of the welfare of the Indonesian people, it can be seen in eight areas of Indonesia's welfare indicators. Indicators The welfare of the Indonesian people is undergoing a digital transformation of information and communication technology (ICT) in 2021. However, there was a gap in ICT development due to geographical conditions and the distribution and dynamics of each region's society. Cluster analysis is a solution for target setting for better future decisions. Fuzzy Geographically Weighted Clustering (FGWC) is one of the cluster methods with fuzzy logic that considers geographical and population elements in grouping targets. The results of the research resulted in three optimum clusters with different characteristics for  each cluster based on indicators of ICT aspects of people's welfare. Cluster 1 has a medium status of ICT indicators of people's welfare and is located in the middle or at the end of the island, provinces from cluster 2 have a low status of ICT indicators of people's welfare with a medium area, while cluster 3 has a high status of ICT indicators of people's welfare with a large area or dense populations.
Prediction Of Bogor City Rainfall Parameters Using Long Short Term Memory (LSTM) Sherly Amora Jofipasi; Admi Salma; Dodi Vionanda; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/110

Abstract

Bogor is a city that has high intensity of rainfall and has erratic rainfall. So it is necessary to predict Bogor's rainfall. Rainfall prediction can be done using the LSTM algorithm. In the LSTM algorithm, there are neuron hidden layer and epoch parameters. Neuron hidden layer and epoch greatly affect the resulting prediction results, therefore it is necessery to determine the best neuron hidden layer and epoch values to produce good prediction results in Bogor rainfall. The prediction parameters results obtained by LSTM have worked well using optimal neuron hidden values of 256, optimal epoch of 150, MAPE of 1,64%, and the comparison of actual data patterns and prediction data already has the same data patterns.
Comparison of Error Rate Prediction in CART for Imbalanced Data Lifia Zullani; Dodi Vionanda; Syafriandi Syafriandi; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/117

Abstract

CART is one of the tree based classification algorithms. CART is a tree consisting of root nodes, internal nodes, and terminal nodes. The accuracy of the model in CART can be calculated by measuring prediction errors in the model. One common method used to predict error rates is cross-validation. There are three cross-validation algorithms, namely leave one out, hold out, and k-fold cross-validation. These methods have different performance in dividing data into training data and testing data, so there are advantages and disadvantages to each method. Every algorithm has its shortcomings; hold out cannot guarantee that the training set represents the entire dataset, leave one out is very time-consuming and requires significant computation because it has to train the model as many times as there are data points, and k-fold provides longer computation time because the training algorithm must be run k times. In reality, the data often encountered is imbalanced. Imbalanced data refers to data with a different number of observations in each class. In CART, imbalanced data affects the prediction results. This research focuses on comparing error rate prediction methods in the CART model with imbalanced data. The study uses three types of data: univariate, bivariate, and multivariate, obtained from differences in population means and correlations between independent variables. The results obtained indicate that the k-fold algorithm is the most suitable error rate prediction algorithm applied to CART with imbalanced data.
Implementation of Backpropagation Artificial Neural Network on Forecasting Export of Palm Oil in Indonesia Adinda Dwi Putri; Dina Fitria; Nonong Amalita; Zilrahmi
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/123

Abstract

Export activities are one of the largest revenues in Indonesia with the largest contributor to export is being palm oil. Increasing volume of palm oil exports, it will be able to spur economic growth in Indonesia. In this research, palm oil export forecasting in Indonesia is carried out based on the main destination countries using the Artificial Neural Network (ANN) method with the Backpropagation algorithm. The data used is palm oil export data for 2012-2022 obtained from the Central Statistics Agency (BPS) website. From the data used, the optimal architecture model is 10-1-3-3-1 with a MAPE of 9.68%, which means that this architecture uses 10 input data, 3 hidden layers with the number of each input neuron (1,3,3), and there is 1 output output. From this study, it is estimated that 90% of the results of palm oil export forecasting using the ANN method are close to the actual value.
Classification the Characteristics of Traffic Accident Victims in Pariaman Using the Chi-square Automatic Interaction Detection Algorithm Manja Danova Putri; Dina Fitria; Yenni Kurniawati; Zilrahmi
UNP Journal of Statistics and Data Science Vol. 2 No. 1 (2024): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol2-iss1/127

Abstract

Traffic accidents are incidents that occur when motor vehicles collide on the road, resulting in damage to vehicles and road infrastructure, as well as the potential for material losses, injuries, physical damage, and even death for those involved. Data from the Indonesian National Police show that the number of traffic accident victims between 2010 and 2020 ranged from 147.798 to 197.560 people, with fatalities predominantly occurring among individuals aged 15-34. The high number of traffic accident victims has negative impacts on various aspects of life, ranging from material losses to physical damage to the victims. Classification is a technique used to group objects or data into pre-defined classes or categories based on their attributes or features. One method in the field of classification is Chi-Square Automatic Interaction Detection (CHAID). The results of the classification using this method indicate that the age of the victims and the type of accident are the most significant variables influencing the condition of traffic accident victims. The evaluation of the model using a confusion matrix yielded an accuracy rate of 92%. This indicates that the model performs well in overall data classification.
Comparison of the C5.0 Algorithm and the CART Algorithm in Stroke Classification Indah Lestari; Dina Fitria; Syafriandi Syafriandi; Admi Salma
UNP Journal of Statistics and Data Science Vol. 2 No. 1 (2024): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol2-iss1/144

Abstract

The C5.0 and CART algorithms are similar in terms of velocity and handling of categorical and numeric type data. However, these two algorithms are differences in terms the CART algorithm is binary and classifies categorical, numerical and continuous response variables resulting in classification and regression decision trees. Meanwhile, the C5.0 algorithm is non-binary and classifies categorical response variables resulting in a classification tree. This research aims to classify the Kaggle’s Stroke Prediction Dataset to find out the variables that most influence the risk of stroke, as well as to compare the results of the classification accuracy of the both algorithms. The results of the study showed that CART algorithm has a higher value of accuracy and precision, but its recall value is lower than C5.0. The accuracy value of each algorithm is 77.9% and 77.5%, presision is 89.5% and 83.2%, recall is 67% and 71.4%. Overrall, it can be concluded that there is no difference in classification between the two algorithm. Beside that, in the CART there were 3 variables that most influence on stroke risk, they are age, BMI, and average blood glucose levels. Meanwhile, in C5.0 only 2 variable that most influence, there are age and average blood glucose levels.
Sentiment Analysis about Anti-LGBT Campaign using the Naïve Bayes Classifier rios; Syafriandi Syafriandi; Dony Permana; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 2 No. 1 (2024): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol2-iss1/146

Abstract

Social media is growing so that the news that is discussed is also very fast to be known by everyone. The news or topic that is being discussed on social media is the anti-LGBT campaign. The conversation about the anti-LGBT campaign is expressed in the form of opinions that contain positive and negative feelings. The opinion is conveyed through Twitter. Twitter is a microblogging social media site that allows users to create short messages and share them easily and quickly. Opinions on Twitter are used to see whether the opinion rejects or supports the anti-LGBT campaign. The use of sentiment analysis helps to see the opinion supports or rejects the anti-LGBT campaign. The algorithm used to perform sentiment analysis is the Naïve Bayes Classifier. The purpose of this study is to determine the sentiment analysis of anti-LGBT campaign tweets on Twitter. This study using Phython as the tools. The dataset used is 3103 tweets with 80% training data and 20% test data. The sentiment analysis results obtained in this study show that Twitter users in Indonesia have more positive opinions. The use of the Naïve Bayes Classifier algorithm produces an accuracy of 68,75%, precision of 99,6%, and recall of 92,8%.