Claim Missing Document
Check
Articles

Found 2 Documents
Search

De-identification of Protected Health Information in Clinical Document Images using Deep Learning and Pattern Matching Sriram, Ravichandra; Sathya S, Siva; Lourdumarie Sophie S
Journal of Electronics, Electromedical Engineering, and Medical Informatics Vol 7 No 1 (2025): January
Publisher : Department of Electromedical Engineering, POLTEKKES KEMENKES SURABAYA

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.35882/jeeemi.v7i1.616

Abstract

Clinical documents that include lab results, discharge summaries, and radiology reports of patients are generally used by doctors for diagnosis and treatment. However, with the popularization of AI in healthcare, clinical documents are also widely used by researchers for disease diagnosis, prediction, and developing schemes for quality healthcare delivery. Though huge volumes of clinical documents are produced in various hospitals every day, they are not shared with researchers for study purposes due to the sensitive nature of health records. Before sharing these documents, they must be de-identified, or the protected health information (PHI) should be removed for the purpose of preserving the patient's privacy. If the documents are stored digitally, this PHI can be easily identified and removed, but finding and extracting PHI from old clinical documents that are scanned and stored as images or other formats is quite a daunting task for which machine learning models have to be trained with a large number of such images. This work introduces a novel combination of deep learning and pattern matching algorithms for the efficient de-identification of scanned clinical documents, distinguishing it from previous methods, which can primarily work only on text documents and not on scanned clinical documents or images. Thus, a comprehensive de-identification technique for automatically extracting protected health information (PHI) from scanned images of clinical documents is proposed. For experimental purposes, we created a synthetic dataset of 700 clinical document images obtained from various patients across multiple hospitals. The de-identification framework comprises two phases: (1) Training of YoloV3- Document Layout Analysis (Yolo V3-DLA) which is a Deep learning model to segment the various regions in the clinical document. (2) Identifying regions containing PHI through pattern-matching techniques and deleting or anonymizing the information in those regions. The proposed method was implemented to identify regions based on content structure, facilitating the de-identification of PHI regions and achieving an F1 score of 0.97. This system can be readily adapted to accommodate any form of clinical document.
Experimental evaluation of bidirectional encoder representations from transformers models for de-identification of clinical document images Sriram, Ravichandra; Sundaram, Siva Sathya; Sophie, S. LourduMarie
IAES International Journal of Robotics and Automation (IJRA) Vol 14, No 2: June 2025
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/ijra.v14i2.pp273-280

Abstract

Many health institutes maintain patients’ diagnosis and treatment reports as scanned images. For healthcare analytics and research, large volumes of digitally stored patient information have to be accessed, but the privacy requirements of protected health information (PHI) limit the research opportunities. Particularly in this artificial intelligence (AI) era, deep learning models require large datasets for training purposes, which hospitals cannot share unless the PHI fields are de-identified. Manual de-identification is beyond possible, with millions of patient records generated in hospitals every day. Hence, this work aims to automate the de-identification of clinical document images utilizing AI models, particularly pre-trained bidirectional encoder representations from transformers (BERT) models. For the purpose of experimentation, a synthetic dataset of 550 clinical document images was generated, encompassing data obtained from diverse patients across multiple hospitals. This work presents a two-stage transfer learning approach, initially employing Tesseract character recognition (OCR) to convert clinical document images into text. Subsequently, it extracts PHI fields from the text for de-identification. For the purpose of extraction, BERT models were utilized; in this work, we contrasted six pre-trained versions of such models to examine their effectiveness and achieve the F1 score of 92.45%, thus showing better potential for de-identifying PHI data in clinical documents.