Garuda - Garba Rujukan Digital

Journal of Information Systems and Technology

Vol. 2 No. 1 (2026): Journal of Information Systems and Technology

Dwi Fatmasari (Universitas Negeri Surabaya)

Publish Date
22 Jun 2026

Advances in multimodal deep learning have driven growing interest in attention mechanisms that enhance audio and visual integration for tasks such as emotion recognition, event localization, and human computer interaction. This comprehensive survey synthesizes recent progress in attention based fusion methods and highlights the evolution from early fusion strategies to more advanced architectures, including self-attention, cross modal attention, co attention, and hierarchical attention. Transformer based models, in particular, now play a central role in state of the art audio visual systems because they capture long range temporal and semantic relationships across modalities. This survey examines how these mechanisms improve contextual understanding and task performance, while also identifying persistent challenges related to interpretability, robustness to noisy or missing modalities, modality imbalance, and computational efficiency. Limitations associated with dataset bias and the lack of standardized evaluation metrics are also discussed. Finally, the survey presents future research directions, including the development of cross modal transformer architectures, hierarchical attention models, and comprehensive attention diagnostics frameworks to support trustworthy and effective multimodal artificial intelligence systems.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Journal of Information Systems and Technology

Website

Abbrev

jistech

Publisher

PT Atha Publishing Globalindo

Subject

Computer Science & IT Control & Systems Engineering Electrical & Electronics Engineering Mechanical Engineering

Description

Journal of Information Systems and Technology is an international peer-reviewed journal dedicated to publishing high-quality research, reviews, and practical studies in the field of information systems and technology. The journal aims to provide a platform for researchers, academics, practitioners, ...

Article Info

Abstract

Deep Audio-Visual Fusion with Attention Mechanisms for Multimodal Perception: A Systematic Review

Article Info

Abstract