The significant growth of digital and social media platforms has introduced massive streams of unstructured media data. However, current big data approaches are not specifically tailored to the high volume and velocity of media data, which consists of unstructured and lengthy full-text messages. This study proposes a modular and stream-oriented big data architecture for media data. The proposed architecture consists of data crawlers, a message broker, machine learning modules, persistent storage, and analytical dashboards, with a publish-subscribe communication pattern to enable asynchronous, decoupled data processing. The system integrates IndoBERT, a transformer-based model fine-tuned for the Indonesian language, enabling real-time semantic tagging within the streaming pipeline. The proposed solution has been implemented as a prototype using open-source technologies in an on-premise cluster. As such, the primary novelty is the successful integration and operationalization of a large, transformer-based language model (IndoBERT) within a low-latency streaming pipeline. The experimental results underscore the feasibility of deploying scalable, vendor-neutral media analytics platforms for institutions with high sensitivity to privacy and cost. Architectural quality is quantitatively evaluated through Martin's Instability Metric and Coupling Between Objects (CBO), confirming high modularity across components. The system demonstrates an end-to-end latency of 3.121 seconds, deep learning latency of 2.333 seconds, and processes 32,102 messages per day, making an explicit trade-off where the 2.333-second deep learning inference provides advanced semantic depth. This study presents a reference architecture for scalable, intelligent real-time media analytics systems that support public sector and academic deployments, requiring data privacy and control over infrastructure.
Copyrights © 2025