Claim Missing Document
Check
Articles

Found 39 Documents
Search

Comparison of Error Rate Prediction Methods in Classification Modeling with the CHAID Method for Imbalanced Data Seif Adil El-Muslih; Dodi Vionanda; Nonong Amalita; Admi Salma
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/81

Abstract

CHAID (Chi-Square Automatic Interaction Detection) is one of the classification algorithms in the decision tree method. The classification results are displayed in the form of a tree diagram model. After the model is formed, it is necessary to calculate the accuracy of the model. The aims is to see the performance of the model. The accuracy of this model can be done by calculating the predicted error rate in the model. There are three methods, such as Leave one out cross-validation (LOOCV), Hold-out, and K-fold cross-validation. These methods have different performances in dividing data into training and testing data, so each method has advantages and disadvantages. Imbalanced data is data that has a different number of class observations. In the CHAID method, imbalanced data affects the prediction results. When the data is increasingly imbalanced the prediction result will approach the number of minority classes. Therefore, a comparison was made for the three error rate prediction methods to determine the appropriate method for the CHAID method in imbalanced data. This research is included in experimental research and uses simulated data from the results of generating data in RStudio. This comparison was made by considering several factors, for the marginal opportunity matrix, different correlations, and several observation ratios. The results of the comparison will be observed using a boxplot by looking at the median error rate and the lowest variance. This research finds that K-fold cross-validation is the most suitable error rate prediction method applied to the CHAID method for imbalanced data.
Implementation of the Self Organizing Maps (SOM) Method for Grouping Provinces in Indonesia Based on the Earthquake Disaster Impact Ihsan Dermawan; Admi Salma; Yenni Kurniawati; Tessy Octavia Mukhti
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/83

Abstract

Indonesia's strategic geological location causes Indonesia to be frequently hit by earthquake disasters, which are a series of events that disturb and threaten the safety of life and cause material and non-material losses. The number of earthquake events in Indonesia causes casualties, both fatalities and injuries, destroying the surrounding area as well as destroying infrastructure and causing property losses. Therefore, it is important to cluster the impact of earthquake disasters in Indonesia as a disaster mitigation effort in order to determine the characteristics of each province. The clustering method used is Kohonen Self Organizing Maps (SOM). SOM is a high-dimensional data visualization technique into a low-dimensional map. The results of this study obtained 3 clusters with the characteristics of each cluster. The first cluster with low impact of earthquake disaster consists of 32 provinces. The second cluster with moderate impact consists of 1 province characterized by the highest number of missing victims and the highest number of injured victims. The third cluster with a high impact consists of 1 province with the most prominent characteristics being the number of earthquake events, the number of deaths, the number of injured, the number of displaced, the number of damaged houses, the number of damaged educational facilities, the number of damaged health facilities and the number of damaged worship facilities is the highest of the other clusters.
Comparison of Error Rate Prediction Methods in Binary Logistic Regression Modeling for Imbalanced Data Bahri Annur Sinaga; Dodi Vionanda; Dony Permana; Admi Salma
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/86

Abstract

Binary logistic regression is a regression analysis used in classification modeling. The performance of binary logistic regression can be seen from the accuracy of the model formed. Accuracy can be measured by predicting the error rate. One method of predicting the error rate that is often used is cross-validation. There are three algorithms in cross-validation: leave one out, hold out, and k-fold. Leave one out is a method that divides data based on the number of observations so that each observation has the opportunity to become testing data but requires a long time in the analysis process when the number of observations is large. Hold out is the simplest algorithm that only divides the data into two parts randomly, so there is a possibility that important data does not become training data. K-fold is an algorithm that divides data into several groups, but k-fold is not suitable for data that has a small number of observations. In reality, real data is often imbalanced. In logistic regression,when the data is increasingly imbalanced, the prediction results will approach the number of minority classes. This research focuses on the comparison of error rate prediction methods in binary logistic regression modeling with imbalanced data. This study uses three types of data, namely univariate, bivariate, and multivariate, which are generated by differences in population mean and correlation between independent variables.The results obtained show that the k-fold algorithm is the most suitable error rate prediction algorithm applied to binary logistic regression.
Application of the Fuzzy Time Series-markov Chain Method to the Rupiah Exchange Rate Against the US Dollar (USD) rahmad revi fadillah; Dony Permana; Yenni Kurniawati; Admi Salma
UNP Journal of Statistics and Data Science Vol. 1 No. 4 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss4/91

Abstract

The exchange rate plays an important role in evaluating the Indonesian economy due to how much it affects the nation's overall financial situation. Activities for projecting future exchange rates can be conducted based on their dynamic characteristics. The purpose of this study is to predict the exchange rate of the Indonesian Rupiah (IDR) against the United States Dollar (USD) using the Fuzzy Time Series Markov chain (FTS-MC) method. Researchers apply the FTS-MC approach to analyze the connection between every bit of historical data and the direction in which it moved in order to forecast future data movements. While the rupiah exchange rate Forecast against the USD between January 2 and January 31, 2023, with a MAPE value of 2.41% and a forecast accuracy score of 97.58% result. During up to 8 forecasted periods, the forecasting value gained by the FTS-MC approach is close to the actual value, and the next period is higher than the current value. The forecasting results graph further shows that the FTS-MC approach gives forecast values fluctuate around IDR15,800.
Sentiment Analysis of TikTok Application on Twitter using The Naïve Bayes Classifier Algorithm Denia Putri Fajrina; Syafriandi Syafriandi; Nonong Amalita; Admi Salma
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/103

Abstract

TikTok is a popular social media platform that has gained a lot of attention lately. People of all ages are using this application to share short videos with their friends and followers. The content on TikTok is diverse and can be tailored to individual preferences, but there have been concerns about the presence of vulgar content that can be viewed by minors as there are no age restrictions. This has led to public scrutiny of the application on social media platforms like Twitter. To address this issue, sentiment analysis was conducted on reviews of the TikTok application to help users make informed decisions about its use. The aim of this analysis was to determine whether people's opinions about TikTok were positive or negative. To achieve this goal, researchers used the hashtag "TikTok Application".The results were classified into two categories positive and negative using the Naïve Bayes Classifier method. The analysis was carried out using 80% training data and 20% testing data, and the results showed an accuracy rate of 80.32%, with a recall value of 97% and a precision value of 78%. In general, positive feedback from Indonesians on the TikTok application is related to the invitation to download the TikTok application, while in negative feedback, information is obtained that the TikTok application is based on content that is inappropriate for TikTok users to download This information can help users make informed decisions about using the TikTok application.
Prediction Of Bogor City Rainfall Parameters Using Long Short Term Memory (LSTM) Sherly Amora Jofipasi; Admi Salma; Dodi Vionanda; Dina Fitria
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/110

Abstract

Bogor is a city that has high intensity of rainfall and has erratic rainfall. So it is necessary to predict Bogor's rainfall. Rainfall prediction can be done using the LSTM algorithm. In the LSTM algorithm, there are neuron hidden layer and epoch parameters. Neuron hidden layer and epoch greatly affect the resulting prediction results, therefore it is necessery to determine the best neuron hidden layer and epoch values to produce good prediction results in Bogor rainfall. The prediction parameters results obtained by LSTM have worked well using optimal neuron hidden values of 256, optimal epoch of 150, MAPE of 1,64%, and the comparison of actual data patterns and prediction data already has the same data patterns.
Bitcoin Price Prediction Using Support Vector Regression Wulan Septya Zulmawati; Nonong Amalita; Syafriandi Syafriandi; Admi Salma
UNP Journal of Statistics and Data Science Vol. 1 No. 5 (2023): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol1-iss5/121

Abstract

Cryptocurrency provides the most return compared to other investment instruments, causing many novice traders to be attracted to crypto as a tool to make significant profits in the short term. One of the most widely used cryptocurrencies is Bitcoin. Trading is closely related to technical analysis. Various techniques in technical analysis cause beginner traders to have difficulties choosing the right technique. Machine learning methods can be an alternative to overcoming the barriers of beginner traders in the crypto market with predictive methods. One method of machine learning for prediction is Support Vector Regression (SVR). Using the grid search algorithm shows that this method has a good predictive accuracy value of 99,25% and MAPE 0,1206%.
Biplot and Procrustes Analysis of Poverty Indicators By Province in Indonesia in 2015 dan 2019 Ade Eriyen Saputri; Admi Salma; Nonong Amalita; Dony Permana
UNP Journal of Statistics and Data Science Vol. 2 No. 1 (2024): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol2-iss1/124

Abstract

Poverty is one of the country's problems that the government should  overcome. Poverty is influenced by several indicators. The success of a government can be seen from changes in poverty. This study compares the percentage of Indonesia's poverty indicators at the beginning of office (2015) and the end of office (2019) of one government period. The indicators that most affect the poverty rate in 2015 and 2019 are seen using biplot analysis while to measure the similarity and the magnitude of the percentage change in poverty from 2015 to 2019 can use procrustes analysis. The results of the biplot analysis show households that have access to decent and sustainable sanitation services as the indicator with the highest diversity in 2015 while in 2019 it is the percentage of youth  (aged 15-24 years) not in education, employment or training and households that have access to decent and sustainable drinking water services. Kepulauan Riau, DKI Jakarta, DI Yogyakarta, and Bali are the provinces that have the highest values in almost all poverty indicators except the indicator of the percentage of youth  (aged 15-24 years) not in education, employment or training. The results of the procrustean analysis show an increase of 9.7% in Indonesia's poverty indicators in 2019 compared to 2015. So it can be said that the two configurations are very similar.
Classification of Stroke Disease at Dr. Drs. M. Hatta Brain Hospital Bukittinggi With Decision Tree Algorithm C4.5 Futiah Salsabila; Zamahsary Martha; Atus Amadi Putra; Admi Salma
UNP Journal of Statistics and Data Science Vol. 2 No. 1 (2024): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol2-iss1/135

Abstract

Stroke is a health condition that has vascular disorders where brain  function is related to problems with blood vessels that carry blood to the brain. Several factors that can influence stroke include unhealthy eating habits, lack of physical activity, smoking behavior, alcohol consumption, and obesity. The symptoms experienced are headache, nausea, vomiting, blurred vision and difficulty swallowing. The researcher’s aim is to determine the risk faktors that affect the incidence of stroke hospitalization based on stroke diagnoses at Rumah Sakit Otak Dr. Drs. M. Hatta Bukittinggi city by classifying each variable using a decision tree. A decision tree is a flowchart that resembles a branching tree. The C4.5 algorithm is used in this research, which can process numerical and categorical data, can handle missing attribute values, and produces rules that are easy to interpret. The results of the analysis show that the attribute that is a risk factor for stroke is the heart. The model created using the C4.5 algorithm was tested using a counfusion matrix resulting in an accuracy of 64.54%, a precision of 53.34% for classifying ischemic stroke patients correctly, and a recall of 72.73% for classifying hemorrhagic patients correctly.  
Comparison of the C5.0 Algorithm and the CART Algorithm in Stroke Classification Indah Lestari; Dina Fitria; Syafriandi Syafriandi; Admi Salma
UNP Journal of Statistics and Data Science Vol. 2 No. 1 (2024): UNP Journal of Statistics and Data Science
Publisher : Departemen Statistika Universitas Negeri Padang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.24036/ujsds/vol2-iss1/144

Abstract

The C5.0 and CART algorithms are similar in terms of velocity and handling of categorical and numeric type data. However, these two algorithms are differences in terms the CART algorithm is binary and classifies categorical, numerical and continuous response variables resulting in classification and regression decision trees. Meanwhile, the C5.0 algorithm is non-binary and classifies categorical response variables resulting in a classification tree. This research aims to classify the Kaggle’s Stroke Prediction Dataset to find out the variables that most influence the risk of stroke, as well as to compare the results of the classification accuracy of the both algorithms. The results of the study showed that CART algorithm has a higher value of accuracy and precision, but its recall value is lower than C5.0. The accuracy value of each algorithm is 77.9% and 77.5%, presision is 89.5% and 83.2%, recall is 67% and 71.4%. Overrall, it can be concluded that there is no difference in classification between the two algorithm. Beside that, in the CART there were 3 variables that most influence on stroke risk, they are age, BMI, and average blood glucose levels. Meanwhile, in C5.0 only 2 variable that most influence, there are age and average blood glucose levels.