Data mining is a crucial method in the realm of Big Data for extracting valuable predictive insights from extensive datasets. In the contemporary digital landscape, a significant difficulty is preserving individual privacy during data mining, particularly in safeguarding sensitive outliers that may harbour personal information. Outliers are data points that markedly diverge from the overall trend and frequently encompass very specialised or sensitive information. This paper examines the comparative efficacy of various clustering algorithms employed in outlier detection, specifically PAM (Partitioning Around Medoids), CLARA (Clustering Large Applications), CLARANS (Clustering Large Applications Based on Randomised Search), and ECLARANS (Enhanced CLARANS). This study aims to evaluate the efficacy of each algorithm in identifying outliers and to examine the usefulness of the employed privacy protection strategy, specifically the Gaussian Perturbation Random method. This experiment utilises two health datasets: the Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases and the Wisconsin Breast Cancer Dataset. The two datasets were chosen because to their multivariate features, which exhibit adequate data variation for outlier detection. The study's results indicate that the CLARA algorithm effectively identified a superior quantity of outliers compared to the other algorithms, with the diabetes dataset exhibiting the greatest count of outliers (65 outliers). The CLARA algorithm shown superiority in identifying outliers within extensive datasets due to the utilisation of a sampling methodology. Conversely, the PAM, CLARANS, and ECLARANS algorithms identified a same quantity of outliers in both datasets. ECLARANS shown superior time efficiency on the diabetic dataset, but CLARA demonstrated the highest efficiency on the breast cancer dataset. The Gaussian Perturbation Random technique was employed for preserving the identified sensitive outliers. The findings indicate that this strategy effectively maintains privacy while ensuring detection accuracy is not compromised. This method provides a dependable means of safeguarding individual privacy in health data mining, a domain characterised by significant privacy concerns.
Copyrights © 2025