Biometric authentication systems still face fundamental limitations, particularly in unimodal approaches that are vulnerable to environmental variations and visual spoofing attacks. To address these challenges, multimodal biometrics integrating physiological and behavioral traits have become an increasingly relevant approach. This study presents an analytical review of recent research in visual multimodal biometrics, with a focus on score-level fusion strategies and the integration of static face recognition and dynamic lip movement analysis as a non-vocal authentication mechanism. The literature synthesis indicates that score-level fusion is the most flexible and stable approach for combining heterogeneous biometric modalities, especially when integrating static spatial features and dynamic temporal patterns. Furthermore, Transformer-based deep learning architectures are identified as having significant potential for modeling the temporal dependencies of lip movements. This study also highlights key security challenges, particularly presentation attacks and visual-only deepfakes, and emphasizes the importance of visual dynamics–based liveness detection as an integral component of biometric authentication systems. Based on these findings, the study formulates a conceptual framework for visual multimodal biometric authentication that integrates identity verification and liveness detection within a unified process, while also identifying future research opportunities, including self-supervised learning, model optimization for resource-constrained devices, and the design of more discriminative visual passphrases.
Copyrights © 2025