Finding and treating cancer as soon as possible help patients get better outcomes. Patients requiring imaging or biopsy tests sometimes find it challenging to access them because these procedures are often limited by their high cost and availability in clinical settings. Recent AI methods, particularly those involving deep learning, can address these problems and significantly enhance the process for detecting cancer, offering greater efficiency and scalability. In this context, LLMs and VLMs are considered leading solutions for trying to make sense of multimodal variables within AI-driven healthcare systems. Although LLMs are strong at working with unstructured, clinically related text data, they have not often been used for patient assessment beyond descriptive or summarization tasks, by combining images and descriptions, along with both structured and unstructured data. The VLMs allow doctors and medical researchers to catch cancer symptoms from multiple angles. In this work, we study both LLMs and VLMs in cancer detection, analyzing their architectures, learning mechanisms, and performance on various datasets, and identifying directions for expanding multimodal AI in healthcare. Our results indicate that combining these two data types enhances how accurately we are able to diagnose patients across different types of cancer. Our studies in MIMIC-III, MIMIC-IV, TCGA, and CAMELYON 16/17 datasets revealed that multimodal transformer models significantly improve the accuracy of diagnosing biopsy results. In particular, BioViL achieves an AUC-ROC of 0.92 for detecting lung cancer, whereas CLIP Fine-tuned achieves a similar result of 0.91 for colon cancer detection.
Copyrights © 2025