Hoger Khayrolla Omar
University of Kirkuk

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Distributed big data analysis using spark parallel data processing Hoger Khayrolla Omar; Alaa Khalil Jumaa
Bulletin of Electrical Engineering and Informatics Vol 11, No 3: June 2022
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v11i3.3187

Abstract

Nowadays, the big data marketplace is rising rapidly. The big challenge is finding a system that can store and handle a huge size of data and then processing that huge data for mining the hidden knowledge. This paper proposed a comprehensive system that is used for improving big data analysis performance. It contains a fast big data processing engine using Apache Spark and a big data storage environment using Apache Hadoop. The system tests about 11 Gigabytes of text data which are collected from multiple sources for sentiment analysis. Three different machine learning (ML) algorithms are used in this system which is already supported by the Spark ML package. The system programs were written in Java and Scala programming languages and the constructed model consists of the classification algorithms as well as the pre-processing steps in a figure of ML pipeline. The proposed system was implemented in both central and distributed data processing. Moreover, some datasets manipulation manners have been applied in the system tests to check which manner provides the best accuracy and time performance. The results showed that the system works efficiently for treating big data, it gains excellent accuracy with fast execution time especially in the distributed data nodes.