Multimodal registration of 3D medical images (3D-MReg) plays a key role in several medical applications and remains a very challenging task as it deals with multimodal images and volumetric objects at the same time. Recently, convolutional neural networks (CNNs) based approaches have been proposed to solve 3D-MReg. However, these techniques cannot preserve the global spatial context required for accurate affine registration since they rely on convolution and regional clustering operations. To solve these problems, we propose a supervised approach that combines both CNN and the vision transformer (ViT) to predict a dense displacement field (DDF). In a first step, our method investigates the power of ViT to capture global voxels dependencies for initial rigid alignment. Then we exploit the force of CNNs to focus on local details within pre-aligned concatenated input 3D moving and fixed images and estimate DDF, which is then applied to the moving labels. Our method has been validated in a prostate magnetic resonance imaging/transrectal ultrasound (MRI/TRUS) dataset and achieved promising results compared to previous work based on only CNNs.
Copyrights © 2026