Purpose - The global rise in identity fraud and credential-based breaches, with average costs reaching USD 4.88 million, highlights the limitations of unimodal biometric authentication systems, which are vulnerable to spoofing attacks and environmental degradation. Existing multimodal approaches also rely on static weighting strategies that lack adaptability to adversarial conditions or degraded data quality.This study proposes a Multimodal Deep Fusion Network (MDFN), an end-to-end deep learning architecture that integrates three biometric modalities: facial (visual), voice (audio), and keystroke dynamics (behavioral). The MDFN employs three independent feature extraction streams—ResNet-18 for visual data, 1D CNN for audio data, and Bi-LSTM for behavioral data—fused through an Attention-based Adaptive Weighted Fusion mechanism that dynamically adjusts to data quality. Methods - The evaluation was conducted using the VGGFace2, VoxCeleb1, and CMU Keystroke datasets under both normal and spoofing attack scenarios, using metrics such as EER, FAR, FRR, AUC, and APCER. Findings - The results show that the MDFN achieves an EER of 1.12% and an FAR of 1.12%, significantly outperforming the unimodal baselines (best EER: 4.15%) and static fusion models (EER: 1.95%). The system also demonstrated strong robustness against spoofing, achieving an APCER as low as 0.8%. Research Implications - MDFN is an effective authentication solution for high-security environments. Originality - Its key contribution lies in the attention-based adaptive fusion mechanism, which dynamically adjusts the modality weights based on a real-time quality assessment.