Multimodal Medical Imaging Fusion (MMIF) is defined as the incorporation of information from multiple imaging modalities in a way that is mutually supplementary, thereby addressing limitations associated with using a single imaging modality to evaluate a patient and increasing diagnostic accuracy. Further, this review provides a dedicated synthesis of deep learning architectures in MMIF, examining CNN-based hybrids, attention-enhanced transformers, GAN-driven unsupervised fusion, and emerging diffusion models. The state of the art in MMIF can be classified into three levels of fusion: (1) pixel level, fusion of raw pixel intensity values to preserve spatial detail; (2) feature level, features are derived from textures, edges, and region-of-interest (ROI) descriptors; (3) decision level, fusing independent outputs of each source using ensemble or rule-based methods to produce a single, integrated output from all sources, potentially improving interpretability of the integrated output. The use of AI algorithms improves fusion outcomes by yielding higher-quality results. However, clinicians' confidence in deep-learning-based models is limited due to their inability to generalise across multiple scanners, protocols, and medical systems. This analysis demonstrates that clinical AI systems must be developed with interpretability as a core attribute, to provide an explanation of how each modality is contributing to the final decision, and to establish a fusion policy that preserves the ability to make accurate diagnostic determinations based on fused images. In addition to developing more sophisticated algorithms, future developments in MMIF will require collaborative partnerships between developers and clinicians to develop fused images into reliable diagnostic tools to be used in precision medicine.
Copyrights © 2026