Claim Missing Document
Check
Articles

Found 2 Documents
Search

Performance Comparison of k-Nearest Neighbor Algorithm with Various k Values and Distance Metrics for Malware Detection Rafrastara, Fauzi Adi; Supriyanto, Catur; Amiral, Afinzaki; Amalia, Syafira Rosa; Al Fahreza, Muhammad Daffa; Ahmed, Foez
JURNAL MEDIA INFORMATIKA BUDIDARMA Vol 8, No 1 (2024): Januari 2024
Publisher : Universitas Budi Darma

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30865/mib.v8i1.6971

Abstract

Malware could evolve and spread very quickly. By these capabilities, malware becomes a threat to anyone who uses a computer, both offline and online. Therefore, research on malware detection is still a hot topic today, due to the need to protect devices or systems from the dangers posed by malware, such as loss/damage of data, data theft, account hacking, and the intrusion of hackers who can control the entire system. Malware has evolved from traditional (monomorphic) to modern forms (polymorphic, metamorphic, and oligomorphic). Conventional antivirus systems cannot detect modern types of viruses effectively, as they constantly change their fingerprints each time they replicate and propagate. With this evolution, a machine learning-based malware detection system is needed to replace the existence of signature-based. Machine learning-based antivirus or malware detection systems detect malware by performing dynamic analysis, not static analysis as used by traditional ones. This research discusses malware detection using one of the classification algorithms in machine learning, namely k-Nearest Neighbor (kNN). To improve the performance of kNN, the number of features is reduced using the Information Gain feature selection method. The performance of kNN with Information Gain will then be measured using the evaluation metrics Accuracy and F1-Score. To get the best score, some adjustments are made to the kNN algorithm, where 3 distance measurement methods will be compared to obtain the best performance along with the variations in the k values of kNN. The distance measurement methods compared are Euclidean, Manhattan, and Chebyshev, while the variations of k values compared are 3, 5, 7, and 9. The result is, kNN with the Manhattan distance measurement method, k = 3, and using information gain features selection method (reduction until 32 features remain) has the highest Accuracy and F1-Score, which is 97.0%.
Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method Rafrastara, Fauzi Adi; Supriyanto, Catur; Paramita, Cinantya; Astuti, Yani Parti; Ahmed, Foez
Jurnal Informatika: Jurnal Pengembangan IT Vol 8, No 2 (2023)
Publisher : Politeknik Harapan Bersama

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30591/jpit.v8i2.5207

Abstract

Handling imbalanced dataset has their own challenge. Inappropriate step during the pre-processing phase with imbalanced data could bring the negative effect on prediction result. The accuracy score seems high, but actually there are many problems on recall and specificity side, considering that the produced predictions will be dominated by the majority class. In the case of malware detection, false negative value is very crucial since it can be fatal. Therefore, prediction errors, especially related to false negative, must be minimized. The first step that can be done to handle imbalanced dataset in this crucial condition is by balancing the data class. One of the popular methods to balance the data, called Random Under-Sampling (RUS). Random Forest is implemented to classify the file, whether it is considered as goodware or malware. Next, 3 evaluation metrics are used to evaluate the model by measuring the classification accuracy, recall and specificity. Lastly, the performance of Random Forest is compared with 3 other methods, namely kNN, Naïve Bayes and Logistic Regression. The result shows that Random Forest achieved the best performance among evaluated methods with the score of 98.1% for accuracy, 98.0% for recall, and 98.2% for specificity.