Garuda - Garba Rujukan Digital

Media Statistika

Vol 18, No 1 (2025): Media Statistika

Mastika, Mastika (Unknown)
Siswantining, Titin (Unknown)
Bustamam, Alhadi (Unknown)

Publish Date
16 Oct 2025

Analysis of gene expression data, particularly in cancer data, often faces challenges due to the presence of missing values. One approach to overcome this is data imputation. This study evaluates the performance of three imputation methods, namely mean imputation, K-Nearest Neighbors (KNN), and KNN with Bayesian optimization using Gaussian Process modeling, on Tumor Educated Platelets (TEP) gene expression data. Missing values were introduced using Missing Completely at Random (MCAR) gradually at levels of 5%, 10%, 15%, and up to 60%, and performance was evaluated using three metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), and Normalized Root Mean Squared Error (NRMSE). The results show that the three methods produce relatively similar performance, with differences in MAE, MSE, and NRMSE values only at a small decimal scale. Although Bayesian Optimization is expected to improve the accuracy of KNN, the resulting improvement on this dataset is not significant. These findings indicate that simple imputation such as the average and KNN-based methods still provide competitive results on TEP data with data characteristics that have 14,020,496 zeros out of a total of 16,512,496 existing values, which is approximately 84.91% of the total data.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref