Bagus Satrio Wahyu Poetro
Universitas Islam Sultan Agung Semarang

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Leakage-Aware and Explainable Machine Learning for Healthcare Claim Fraud Detection Using Imbalanced Medical Insurance Data Dian Hafidh Zulfikar; Ery Setiyawan Jullev Atmadji; Bagus Satrio Wahyu Poetro
International Journal of Artificial Intelligence in Medical Issues Vol. 4 No. 1 (2026): International Journal of Artificial Intelligence in Medical Issues
Publisher : Yocto Brain

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.56705/z3207345

Abstract

Healthcare insurance fraud is a critical challenge in health systems because fraudulent claims may cause financial losses, increase administrative burden, and reduce trust in healthcare services. This study proposes an explainable machine learning approach for detecting fraudulent healthcare insurance claims using imbalanced medical claim data. The dataset consisted of 10,000 healthcare insurance claim records with 20 attributes, including patient information, provider characteristics, claim-related financial variables, medical codes, temporal features, and fraud labels. Fraudulent claims represented only 8.29% of the dataset, indicating a clear class imbalance problem. Several machine learning models were evaluated, including Logistic Regression, Decision Tree, Random Forest, Extra Trees, and AdaBoost, under different imbalance handling strategies, namely baseline learning, class weighting, and SMOTE. In addition, two feature scenarios were compared: a full-feature scenario and a leakage-aware scenario that excluded potentially post-decision variables such as claim status and approved amount. The experimental results showed that the best full-feature model was Logistic Regression without additional imbalance handling, achieving an accuracy of 0.9900, precision of 0.9740, recall of 0.9036, F1-score of 0.9375, ROC-AUC of 0.9989, and PR-AUC of 0.9896. The model correctly detected 150 out of 166 fraudulent claims in the test set. However, the best leakage-aware model achieved a lower F1-score of 0.6983, indicating that potentially leaked variables may substantially affect model performance. Feature importance analysis showed that claim amount, approved amount, claim submission delay, claim status, and provider-related variables were among the most influential predictors. These findings demonstrate that explainable machine learning can support healthcare claim fraud detection, but careful attention must be given to class imbalance, data leakage, and operational deployment context