Indonesian Journal of Artificial Intelligence and Data Mining
Vol 1, No 1 (2018): March 2018

Evaluation of F-Measure and Feature Analysis of C5.0 Implementation on Single Nucleotide Polymorphism Calling

Lailan Sahrina Hasibuan (Department of Computer Sciences, Faculty of Mathematics and Natural Sciences, Bogor Agricultural University Bioinformatics Working Group, Faculty of Mathematics and Natural Sciences, Bogor Agricultural University)
Sita Nabila (Department of Computer Sciences, Faculty of Mathematics and Natural Sciences, Bogor Agricultural University Bioinformatics Working Group, Faculty of Mathematics and Natural Sciences, Bogor Agricultural University)
Nurul Hudachair (Department of Computer Sciences, Faculty of Mathematics and Natural Sciences, Bogor Agricultural University Bioinformatics Working Group, Faculty of Mathematics and Natural Sciences, Bogor Agricultural University)
Muhammad Abrar Istiadi (Department of Computer Sciences, Faculty of Mathematics and Natural Sciences, Bogor Agricultural University Bioinformatics Working Group, Faculty of Mathematics and Natural Sciences, Bogor Agricultural University)



Article Info

Publish Date
01 Mar 2018

Abstract

Data growing in molecular biology has increased rapidly since Next-Generation Sequencing (NGS) technology introduced in 2000, the latest technology used to sequence DNA with high throughput. Single Nucleotide Polymorphism (SNP) is a marker based on DNA which can be used to identify organism specifically. SNPs are usually exploited for optimizing parents selection in producing high-quality seed for plant breeding. This paper discusses SNP calling underlying NGS data of cultivated soybean (Glycine max [L]. Merr) using C5.0, an improved rule-based algorithm of C4.5. The evaluation illustrated that C5.0 is better than the other rule-based algorithm CART based on f-measure. The value of f-measure using C5.0 and CART are 0.63 and 0.58. Besides of that, C5.0 is robust for imbalanced training dataset up to 1:17 but it is suffer in large training dataset. C5.0’s performance may be increased by applying bagging or the other ensemble technique as improvement of CART by applying bagging in final decision. The other important thing is using appropriate features in representing SNP candidates. Based on information gain of C5.0, this paper recommends error probability, homopolymer left, mismatch alt and mean nearby qual as features for SNP calling.

Copyrights © 2018






Journal Info

Abbrev

IJAIDM

Publisher

Subject

Computer Science & IT

Description

Indonesian Journal of Artificial Intelligence and Data Mining (IJAIDM) is an electronic periodical publication published by Puzzle Research Data Technology (Predatech) Faculty of Science and Technology UIN Sultan Syarif Kasim Riau, Indonesia. IJAIDM provides online media to publish scientific ...