Rexcharles Enyinna Donatus
Air Force Instrtute of Technology, Kaduna

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

A Comprehensive Survey of Audio-Visual Fusion with Attention Mechanisms: Trends, Challenges, and Future Directions Rexcharles Enyinna Donatus
Computer Engineering and Applications Journal Vol. 15 No. 2 (2026)
Publisher : Universitas Sriwijaya

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.18495/comengapp.v15i2.1332

Abstract

Advances in multimodal deep learning have driven growing interest in attention mechanisms that enhance audio and visual integration for tasks such as emotion recognition, event localization, and human computer interaction. This comprehensive survey synthesizes recent progress in attention based fusion methods and highlights the evolution from early fusion strategies to more advanced architectures, including self-attention, cross modal attention, co attention, and hierarchical attention. Transformer based models, in particular, now play a central role in state of the art audio visual systems because they capture long range temporal and semantic relationships across modalities. This survey examines how these mechanisms improve contextual understanding and task performance, while also identifying persistent challenges related to interpretability, robustness to noisy or missing modalities, modality imbalance, and computational efficiency. Limitations associated with dataset bias and the lack of standardized evaluation metrics are also discussed. Finally, the survey presents future research directions, including the development of cross modal transformer architectures, hierarchical attention models, and comprehensive attention diagnostics frameworks to support trustworthy and effective multimodal artificial intelligence systems