This study presents a structured methodology for constructing a custom dataset derived from patient visit records collected over a three-year period (January 1, 2019 – December 31, 2021) at a healthcare facility in Bandung Regency, Indonesia. The raw medical records were systematically transformed into a machine learning–ready dataset, involving feature extraction, labeling, and geospatial enrichment. Key transformations included the removal of personally identifiable information, the standardization of clinical symptoms into structured variables, and the assignment of diagnostic and referral labels in accordance with ICD-10 classification standards. Additionally, the dataset was enhanced with spatial coordinates—longitude and latitude—to enable geospatial analyses such as transmission radius estimation, proximity clustering, and identification of regional case densities. This structure supports both supervised and unsupervised learning methods, including classification, referral prediction, and spatial cluster detection. The resulting dataset has been successfully utilized in several advanced experiments: disease classification, referral status prediction, feature importance interpretation using SHAP and LIME, geospatial clustering, and synthetic data generation to mitigate challenges related to privacy and limited data availability. The methodology outlined in this study is expected to support future research in healthcare analytics and contribute to the development of decision support systems and public health policy planning tools.
Copyrights © 2025