Umu Sa'adah
Universitas Brawijaya

Published : 2 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 2 Documents
Search

Finding Biomarkers from a High-Dimensional Imbalanced Dataset Using the Hybrid Method of Random Undersampling and Lasso Masithoh Yessi Rochayani; Umu Sa'adah; Ani Budi Astuti
ComTech: Computer, Mathematics and Engineering Applications Vol. 11 No. 2 (2020): ComTech
Publisher : Bina Nusantara University

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.21512/comtech.v11i2.6452

Abstract

The research conducted undersampling and gene selection as a starting point for cancer classification in gene expression datasets with a high-dimensional and imbalanced class. It investigated whether implementing undersampling before gene selection gave better results than without implementing undersampling. The used undersampling method was Random Undersampling (RUS), and for gene selection, it was Lasso. Then, the selected genes based on theory were validated. To explore the effectiveness of applying RUS before gene selection, the researchers used two gene expression datasets. Both of the datasets consisted of two classes, 1.545 observations and 10.935 genes, but had a different imbalance ratio. The results show that the proposed gene selection methods, namely Lasso and RUS + Lasso, can produce several important biomarkers, and the obtained model has high accuracy. However, the model is complicated since it involves too many genes. It also finds that undersampling is not affected when it is implemented in a less imbalanced class. Meanwhile, when the dataset is highly imbalanced, undersampling can remove a lot of information from the majority class. Nevertheless, the effectiveness of undersampling remains unclear. Simulation studies can be carried out in the next research to investigate when undersampling should be implemented.
Two-stage Gene Selection and Classification for a High-Dimensional Microarray Data Masithoh Yessi Rochayani; Umu Sa'adah; Ani Budi Astuti
JOIN (Jurnal Online Informatika) Vol 5 No 1 (2020)
Publisher : Department of Informatics, UIN Sunan Gunung Djati Bandung

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.15575/join.v5i1.569

Abstract

Microarray technology has provided benefits for cancer diagnosis and classification. However, classifying cancer using microarray data is confronted with difficulty since the dataset has high dimensions. One strategy for dealing with the dimensionality problem is to make a feature selection before modeling. Lasso is a common regularization method to reduce the number of features or predictors. However, Lasso remains too many features at the optimum regularization parameter. Therefore, feature selection can be continued to the second stage. We proposed Classification and Regression Tree (CART) for feature selection on the second stage which can also produce a classification model. We used a dataset which comparing gene expression in breast tumor tissues and other tumor tissues. This dataset has 10,936 predictor variables and 1,545 observations. The results of this study were the proposed method able to produce a few numbers of selected genes but gave high accuracy. The model also acquired in line with the Oncogenomics Theory by the obtained of GATA3 to split the root node of the decision tree model. GATA3 has become an important marker for breast tumors.