Claim Missing Document
Check
Articles

Found 7 Documents
Search

Finding Biomarkers from a High-Dimensional Imbalanced Dataset Using the Hybrid Method of Random Undersampling and Lasso Masithoh Yessi Rochayani; Umu Sa'adah; Ani Budi Astuti
ComTech: Computer, Mathematics and Engineering Applications Vol. 11 No. 2 (2020): ComTech
Publisher : Bina Nusantara University

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.21512/comtech.v11i2.6452

Abstract

The research conducted undersampling and gene selection as a starting point for cancer classification in gene expression datasets with a high-dimensional and imbalanced class. It investigated whether implementing undersampling before gene selection gave better results than without implementing undersampling. The used undersampling method was Random Undersampling (RUS), and for gene selection, it was Lasso. Then, the selected genes based on theory were validated. To explore the effectiveness of applying RUS before gene selection, the researchers used two gene expression datasets. Both of the datasets consisted of two classes, 1.545 observations and 10.935 genes, but had a different imbalance ratio. The results show that the proposed gene selection methods, namely Lasso and RUS + Lasso, can produce several important biomarkers, and the obtained model has high accuracy. However, the model is complicated since it involves too many genes. It also finds that undersampling is not affected when it is implemented in a less imbalanced class. Meanwhile, when the dataset is highly imbalanced, undersampling can remove a lot of information from the majority class. Nevertheless, the effectiveness of undersampling remains unclear. Simulation studies can be carried out in the next research to investigate when undersampling should be implemented.
Knowledge discovery from gene expression dataset using bagging lasso decision tree Umu Sa'adah; Masithoh Yessi Rochayani; Ani Budi Astuti
Indonesian Journal of Electrical Engineering and Computer Science Vol 21, No 2: February 2021
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/ijeecs.v21.i2.pp1151-1159

Abstract

Classifying high-dimensional data are a challenging task in data mining. Gene expression data is a type of high-dimensional data that has thousands of features. The study was proposing a method to extract knowledge from high-dimensional gene expression data by selecting features and classifying. Lasso was used for selecting features and the classification and regression tree (CART) algorithm was used to construct the decision tree model. To examine the stability of the lasso decision tree, we performed bootstrap aggregating (Bagging) with 50 replications. The gene expression data used was an ovarian tumor dataset that has 1,545 observations, 10,935 gene features, and binary class. The findings of this research showed that the lasso decision tree could produce an interpretable model that theoretically correct and had an accuracy of 89.32%. Meanwhile, the model obtained from the majority vote gave an accuracy of 90.29% which showed an increase in accuracy of 1% from the single lasso decision tree model. The slightly increasing accuracy shows that the lasso decision tree classifier is stable.
Two-stage Gene Selection and Classification for a High-Dimensional Microarray Data Masithoh Yessi Rochayani; Umu Sa'adah; Ani Budi Astuti
JOIN (Jurnal Online Informatika) Vol 5 No 1 (2020)
Publisher : Department of Informatics, UIN Sunan Gunung Djati Bandung

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.15575/join.v5i1.569

Abstract

Microarray technology has provided benefits for cancer diagnosis and classification. However, classifying cancer using microarray data is confronted with difficulty since the dataset has high dimensions. One strategy for dealing with the dimensionality problem is to make a feature selection before modeling. Lasso is a common regularization method to reduce the number of features or predictors. However, Lasso remains too many features at the optimum regularization parameter. Therefore, feature selection can be continued to the second stage. We proposed Classification and Regression Tree (CART) for feature selection on the second stage which can also produce a classification model. We used a dataset which comparing gene expression in breast tumor tissues and other tumor tissues. This dataset has 10,936 predictor variables and 1,545 observations. The results of this study were the proposed method able to produce a few numbers of selected genes but gave high accuracy. The model also acquired in line with the Oncogenomics Theory by the obtained of GATA3 to split the root node of the decision tree model. GATA3 has become an important marker for breast tumors.
The Effect of Enrichment Program on the Achievement of Vocational High School Gifted Students in Mathematics Competitions Rochayati, Masithoh Yessi; Rochayani, Masithoh Yessi
Jurnal Ilmu Pendidikan (JIP) STKIP Kusuma Negara Vol 15 No 2 (2024): Innovative Educational Practices and Learning Strategies
Publisher : LPPM STKIP Kusuma Negara

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.37640/jip.v15i2.1855

Abstract

The group of gifted students is a group of students who need special services in learning. SMK Negeri 1 Turen, is one of the schools in Malang Regency that has a platform for gifted students, in the form of a mathematics olympiad team. Every year there are beginner students who join the mathematics olympiad team. Therefore, matriculation is needed for them to catch up on the material they are missing. This research was conducted to find out how many enrichment program meetings are needed for beginner students to equalize their abilities with experienced students. The participants in this study were 10 students, consisting of 6 beginner students and 4 experienced students. Based on the Mann-Whitney test, the results of this study indicate that a long-term enrichment program (17 meetings) has a significant effect on students' achievement in mathematics competitions.
THE ANALYSIS OF SOCIO-ECONOMIC EFFECT ON CRIMINALITY IN INDONESIA USING FUZZY CLUSTERWISE REGRESSION MODEL Azzarah, Dian Fatimah; Mukid, Moch. Abdul; I Maruddani, Di Asih; Rochayani, Masithoh Yessi
MEDIA STATISTIKA Vol 17, No 2 (2024): Media Statistika
Publisher : Department of Statistics, Faculty of Science and Mathematics, Universitas Diponegoro

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.14710/medstat.17.2.221-232

Abstract

Crime in Indonesia has shown a fluctuating trend and has increased significantly in recent years, with striking variations in crime rates between provinces. This phenomenon raises questions about the role of socio-economic factors such as education, poverty, and unemployment in influencing crime rates. Although there have been many studies examining the relationship between these variables and crime, the approaches used often assume that the relationship between variables is homogeneous across regions. In fact, heterogeneity in characteristics between provinces can cause different relationships. Therefore, an analysis approach is needed that can accommodate this diversity. This study proposes the Fuzzy Clusterwise Regression method which not only improves model accuracy compared to classical linear regression (with an increase in the coefficient of determination from 65.72% to more than 90%), but is also able to identify different patterns of relationships between regional groups (clusters). The results from FCR showed that the effect of socio-economic factors on crime varies between clusters and the optimum number of clusters is 4. In cluster 1, cluster 2, and cluster 3 all the variables had a significant influence on the amount of crime. Meanwhile, in cluster 4, the population poverty variable has no significant effect on the crime rate.
Two-stage Gene Selection and Classification for a High-Dimensional Microarray Data Rochayani, Masithoh Yessi; Sa'adah, Umu; Astuti, Ani Budi
JOIN (Jurnal Online Informatika) Vol 5 No 1 (2020)
Publisher : Department of Informatics, UIN Sunan Gunung Djati Bandung

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.15575/join.v5i1.569

Abstract

Microarray technology has provided benefits for cancer diagnosis and classification. However, classifying cancer using microarray data is confronted with difficulty since the dataset has high dimensions. One strategy for dealing with the dimensionality problem is to make a feature selection before modeling. Lasso is a common regularization method to reduce the number of features or predictors. However, Lasso remains too many features at the optimum regularization parameter. Therefore, feature selection can be continued to the second stage. We proposed Classification and Regression Tree (CART) for feature selection on the second stage which can also produce a classification model. We used a dataset which comparing gene expression in breast tumor tissues and other tumor tissues. This dataset has 10,936 predictor variables and 1,545 observations. The results of this study were the proposed method able to produce a few numbers of selected genes but gave high accuracy. The model also acquired in line with the Oncogenomics Theory by the obtained of GATA3 to split the root node of the decision tree model. GATA3 has become an important marker for breast tumors.
A Gaussian Mixture Model Approach to Profiling Stunting Risk Across Indonesian Provinces Rochayani, Masithoh Yessi; Utami, Iut Tri
International Journal of Engineering and Computer Science Applications (IJECSA) Vol. 4 No. 2 (2025): September 2025
Publisher : Universitas Bumigora Mataram-Lombok

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30812/ijecsa.v4i2.5395

Abstract

Stunting is still a major health problem in Indonesia, with notable differences between provinces. Although the national rate has decreased over time, regional gaps continue, emphasizing the role of data in helping to explain what contributes to the issue. This study aims to segment 38 provinces in Indonesia based on maternal and child health indicators associated with stunting prevalence. The variables used include the percentage of low birth weight (LBW) infants, the percentage of infants born short, the percentage of pregnant women with chronic energy deficiency (CED), exclusive breastfeeding (EBF) coverage, prevalence of diarrhea in toddlers, and prevalence of acute respiratory infections (ARI) in toddlers. The clustering analysis was performed using the Gaussian Mixture Model (GMM) with the number of clusters varied from 2 to 7. Model selection was based on the Bayesian Information Criterion (BIC), where the lowest value indicated the optimal model. The results show that the model with two clusters was selected, with a BIC value of 1358.24, which indicates the best balance between model fit and complexity. This clustering reveals that provinces are grouped based on similarities in maternal and child health profiles, not on geographic proximity, meaning that the GMM method does not rely on spatial location to form clusters.