The integration of multimodal inputs, including text, voice, and visual data, into conversational artificial intelligence (AI) systems marks a significant shift toward more natural and effective human–computer interaction. This narrative synthesis review examines recent research on the technological foundations, applications, challenges, and future directions of multimodal conversational AI. Drawing on prior studies, the review analyzes key frameworks and models, including Situated Interactive MultiModal Conversations (SIMMC) and DialogueTRM, which employ multimodal fusion to support emotion recognition and context-aware interaction. The synthesis indicates that combining multiple modalities enhances system accuracy, strengthens user engagement, and enables richer contextual understanding in conversational settings. At the same time, the review identifies major challenges related to data synchronization, privacy protection, computational complexity, and bias mitigation. Based on these findings, the study highlights the need for future research on adaptive fusion techniques, cross-cultural usability, ethical AI development, and the incorporation of emerging modalities such as haptic and physiological data. This review contributes to the growing scholarship on conversational AI by providing an integrated understanding of the opportunities and limitations of multimodal systems and by outlining directions for the development of more responsive, inclusive, and ethically grounded AI interactions.
Copyrights © 2026