Muhammad Raihan Izharul Haq
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Integration of OCR Technology with ETL Processes for Automating Data Pipeline of Financial Disbursement Documents at BPS Sukabumi Regency Muhammad Raihan Izharul Haq; Gina Purnama Insany; Somantri
Jurnal Riset Informatika Vol. 7 No. 4 (2025): September 2025
Publisher : Kresnamedia Publisher

Show Abstract | Download Original | Original Source | Check in Google Scholar | Full PDF (1600.245 KB) | DOI: 10.34288/jri.v7i4.395

Abstract

In the digital era, managing archival data poses challenges for many institutions, including Badan Pusat Statistik (BPS) of Sukabumi Regency, especially when dealing with unstructured PDF documents. This study develops a data pipeline by effectively integrating Optical Character Recognition (OCR) technology with Extract, Transform, Load (ETL) processes. Unstructured data from financial disbursement documents, such as SPM and SP2D, were automatically extracted with high accuracy, achieving an average of 98.52% for SPM using a combination of OCR and PDFPlumber, and 100% for SP2D extracted using PDFPlumber. Extraction results were stored in a data warehouse, then transformed using Apache Spark and loaded into data marts. ETL process was automated using Apache Airflow, which operated reliably according to dependencies. The processed data were presented through an interactive Looker Studio dashboard in real-time, supporting efficient archive management and more informed decision-making. This study not only provides a solution to existing archival management problems but also opens opportunities for further development in the application of big data technologies and business process automation in public sector.