Viruses remain a major concern in global public health due to their potential to cause outbreaks, epidemics, and pandemics. The rapid organization and analysis of virus-related data are important for supporting computational virology, health informatics, and pandemic preparedness. This study proposes an unsupervised machine learning approach to cluster viruses based on taxonomic and genomic characteristics. The dataset consisted of 70 virus records with attributes including family, genus, genome type, strand type, and envelope status. Since the dataset did not contain predefined epidemiological labels or risk categories, the analysis was designed as an exploratory clustering task rather than a supervised prediction task. Data preprocessing was performed by removing duplicates, handling missing values, standardizing categorical attributes, and transforming selected features using One-Hot Encoding. Three clustering algorithms were evaluated, namely K-Means, Agglomerative Clustering, and DBSCAN. The clustering performance was assessed using Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Score, while Principal Component Analysis was applied for two-dimensional visualization. The results showed that K-Means with 10 clusters achieved a Silhouette Score of 0.7725 and a Davies-Bouldin Index of 0.8186. Agglomerative Clustering obtained the highest Silhouette Score of 0.7754, while DBSCAN produced fewer clusters with lower overall performance. Several biologically meaningful groups were identified, including clusters representing Flaviviridae, Coronaviridae, Herpesviridae, Poxviridae, and enveloped RNA viruses. However, a large proportion of records contained unknown values, which influenced the formation of a dominant incomplete-data cluster. These findings indicate that taxonomic and genomic features can support machine learning-based virus grouping, although data completeness remains a critical factor. This study provides an initial computational framework for AI-driven viral data exploration and may serve as a foundation for future viral risk stratification using enriched epidemiological and clinical features.
Copyrights © 2026