Jurnal Riset Informatika
Vol. 7 No. 4 (2025): September 2025

Integration of OCR Technology with ETL Processes for Automating Data Pipeline of Financial Disbursement Documents at BPS Sukabumi Regency

Muhammad Raihan Izharul Haq (Unknown)
Gina Purnama Insany (Unknown)
Somantri (Unknown)



Article Info

Publish Date
12 Sep 2025

Abstract

In the digital era, managing archival data poses challenges for many institutions, including Badan Pusat Statistik (BPS) of Sukabumi Regency, especially when dealing with unstructured PDF documents. This study develops a data pipeline by effectively integrating Optical Character Recognition (OCR) technology with Extract, Transform, Load (ETL) processes. Unstructured data from financial disbursement documents, such as SPM and SP2D, were automatically extracted with high accuracy, achieving an average of 98.52% for SPM using a combination of OCR and PDFPlumber, and 100% for SP2D extracted using PDFPlumber. Extraction results were stored in a data warehouse, then transformed using Apache Spark and loaded into data marts. ETL process was automated using Apache Airflow, which operated reliably according to dependencies. The processed data were presented through an interactive Looker Studio dashboard in real-time, supporting efficient archive management and more informed decision-making. This study not only provides a solution to existing archival management problems but also opens opportunities for further development in the application of big data technologies and business process automation in public sector.

Copyrights © 2025






Journal Info

Abbrev

jri

Publisher

Subject

Computer Science & IT

Description

Jurnal Riset Informatika, merupakan Jurnal yang diterbitkan oleh Kresnamedia Publisher. Jurnal Riset Informatika, berawal diperuntukan menampung paper-paper ilmiah yang dibuat oleh peneliti dan dosen-dosen program studi Sistem Informasi dan Teknik ...