Grouping glycosylated lysine proteins into groups according to the type of glycosylation seen in the lysine protein sequence is known as glycosylation in the lysine protein sequence. In this work, the sensitivity, specificity, accuracy, and Matthew’s correlation coefficient (MCC) of the random forest approach for classifying the glycosylation of lysine protein sequences were examined. With 214 positive and 406 negative data, the lysine protein dataset derived from benchmark data contains 620 total proteins with a protein length of 15 sequences. 90% of the dataset is used for training, while 10% is used for testing. Using the R package BioSeqClass version 1.44.0, feature extraction employed protein descriptors, specifically AA Index, CTD, and PseAAC, with a total of 60 features. The Random Forest classification algorithm was used to reprocess the results with Mtry values of 4, 8, and 16. The number of trees (ntree) was randomly set to 250, 500, 750, and 1000. The best results were achieved with a dataset split of 90% training data and 10% test data, using Mtry of 42 and 1000 trees, resulting in 89.97% sensitivity, 92.79% specificity, 80.76% MCC, and 90.42% accuracy. These results demonstrate that the combination of feature extraction and the Random Forest algorithm is effective in classifying lysine proteins.
                        
                        
                        
                        
                            
                                Copyrights © 2024