JURNAL MATEMATIKA STATISTIKA DAN KOMPUTASI
Vol. 20 No. 3 (2024): May 2024

Performance Evaluation of Classification Methods on Big Data: Decision Trees, Naive Bayes, K-Nearest Neighbors, and Support Vector Machines

Justin Eduardo Simarmata (Faculty of Teacher Training & Education, University of Timor, East Nusa Tenggara, Indonesia)
Gerhard-Wilhelm Weber (2Faculty of Engineering Management, Poznan University of Technology, PUT, PoznaƄ, Poland)
Debora Chrisinta (Faculty of Agriculture, Science and Health, University of Timor, East Nusa Tenggara, Indonesia)



Article Info

Publish Date
15 May 2024

Abstract

Performance evaluation of classification methods on big data is becoming increasingly important in addressing the challenges of data analysis at scale. This study aims to conduct a comparative evaluation of the classification method, namely Decision Trees (DT), Naive Bayes (NB), k-Nearest Neighbors (KNN), and Support Vector Machines (SVM), in analysis on big data evaluated from data simulation and application of real data available in the Rstudio package, namely ISLR. The simulation data used consisted of 2 types of datasets generated based on predictor variables that were normally distributed with different averages and variants and response variables generated in classes adjusted to the characteristics of predictor variables with different proportions. Real data are taken from two types of numeric variables and predictor variables available in the package. The number of sample sizes to be evaluated in each method is n = 500, n = 1000 and n = 5000. In real data, sample division is done randomly to maintain data representativeness. At the evaluation stage, the performance of the method is measured using accuracy metrics. The results of the evaluation of the simulation of Dataset 1 show that the methods that have an influence on the quality of the classification produced if applied to Big Data are the DT and KNN methods. However, in Dataset 2 there is a change in the results of the DT method, because of the influence on the number of classes and the proportion of class distribution in the data. The results obtained from data simulation, proven by applying to real data by showing that similar methods provide a quality influence if applied to Big Data, while the NB and SVM methods do not show a consistent influence when applied to Big Data. The results of observations in this study show that the DT and KNN methods have several advantages that make them suitable for application to Big Data.

Copyrights © 2024






Journal Info

Abbrev

jmsk

Publisher

Subject

Mathematics

Description

Jurnal ini mempublikasikan paper-paper original hasil-hasil penelitian dibidang Matematika, Statistika dan Komputasi ...