Educational institutions increasingly depend on heterogeneous digital systems, yet many analytics initiatives remain fragmented across student information, registration, assessment, and learning platforms. This paper proposes a lakehouse-oriented big data infrastructure for educational analytics and validates it through a reproducible early-risk prediction study using the Open University Learning Analytics Dataset (OULAD). The study integrates five public OULAD tables student information, course registration, assessment metadata, student assessment submissions, and course presentation metadata into temporally valid feature tables aligned to the student–module–presentation level. We define a windowed feature engineering framework that constructs actionable indicators such as submission rate, weighted completion score, average submission lag, and assessment coverage gap at 30%, 50%, 70%, and 100% of the course timeline. Two supervised classifiers, logistic regression and random forest, are evaluated under a stratified 80/20 protocol. The results show that administrative data alone provides weak discrimination (AUC 0.673), whereas integrated mid-course assessment evidence substantially improves performance. At the 50% course window, the random-forest model achieves an AUC of 0.947, F1 of 0.879, and recall of 0.829; even at the 30% window the model already reaches an AUC of 0.904. These findings demonstrate that the value of educational prediction depends not only on model choice but also on data integration architecture. The paper contributes (i) a lakehouse-oriented reference architecture for higher-education analytics, (ii) a temporally constrained feature engineering strategy for early-warning systems, and (iii) an empirical ablation showing that multi-source integration yields large and operationally meaningful gains.
Copyrights © 2025