Identifying the most appropriate food dish based on available kitchen ingredients remains a practical yet challenging task in everyday life. To address this, this study specifically aims to develop an intelligent food classification system using a multimodal approach. We propose a multimodal food classification method that performs early fusion by combining visual and textual features extracted using the Contrastive LanguageāImage Pretraining (CLIP) model. Features from food images and ingredient lists are fused and classified through a two-layer multilayer perceptron. The model is evaluated on the Recipes5k dataset with 4,826 samples across 101 food categories. Results show that the proposed multimodal model achieves 91.32% accuracy, outperforming text-only (85.65%) and image-only (57.26%) baselines. The main contribution of this work lies in demonstrating the effectiveness of early fusion for combining cross-modal representations in food classification. Unlike prior methods, our model supports flexible inference with either text or image input, enabling practical real-world applications. These findings highlight the potential of multimodal learning for food recommendation systems, offering both accuracy and contextual relevance beyond unimodal approaches.
Copyrights © 2026