Flora biodiversity on Sumatra Island is increasingly under pressure due to environmental changes and the limited ability to manage large-scale biodiversity data. This condition requires an approach that can efficiently integrate and analyze data to support data-driven conservation. This study aims to develop a spatial analysis system based on a data lakehouse using Hadoop and Apache Spark to map flora distribution in Sumatra. Data processing is carried out using the Medallion architecture (Bronze, Silver, Gold) and the Extract–Transform–Load (ETL) process with Apache Spark on data from the Global Biodiversity Information Facility (GBIF) for the period 2019–2023. The results show a significant improvement in processing performance, up to 16 times faster, with storage efficiency increased by 28%. This improvement enables large-scale data integration, allowing flora distribution patterns to be identified more clearly and comprehensively. Analysis of 12,840 species shows a dominance of Near Threatened (58.4%), followed by Least Concern (40.8%) and Endangered (0.7%), with distributions concentrated in the western and central regions of Sumatra. These findings indicate that most flora are in a vulnerable condition and confirm the effectiveness of integrating data lakehouse and spatial analysis in supporting data-driven conservation decision-making.
Copyrights © 2026