Congenital heart disease (CHD) is the most common congenital defect and still adds significantly to the neonatal morbidity and mortality rates. Classic echocardiography and ECG unimodal data traditional methods are often unable to analyze complex, multifunctional, and multifactorial cardiac pathologies in neonates. This paper presents an explainable multimodal deep learning framework that acquires four diverse sources of clinical data. Multimodal data includes echocardiogram videos, ECG, and other physiological and structured electronics health record (HER) data. We propose a self-attention-based late fusion transformer architecture that also uses self-attention mechanisms. The model trains and validates on benchmark datasets, which are transparently and reproducibly available (EchoNet-Dynamic, MIMIC-IV, PhysioNet Capnobase, and MIT-BIH). The results achieved using the proposed model mark an improvement over existing benchmarks with 93% accuracy, 95% sensitivity, and 0.96 area under the ROC curve. Using interpretability modules, features that were value added towards determining the diagnostic indicators that were incorporated in the neonatal infant care were shown to be critically relevant. Moreover, the model shows high performance consistency across several data sources and shifts. The research illustrates the use of explainable deep learning architectures for automation of early-stage heart defect detection in newborns. Some of the future work includes validation through clinical studies and multilingual electronic health record integration.