Regita Azzahra, Adela
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Early Fusion of Visual and Ingredient Representations for Multimodal Food Classification Rahma Salsabila, Navira; Regita Azzahra, Adela; Utaminingrum, Fitri; Henryranu Prasetio, Barlian
JOIN (Jurnal Online Informatika) Vol 11 No 1 (2026)
Publisher : Department of Informatics, UIN Sunan Gunung Djati Bandung

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.15575/join.v11i1.1725

Abstract

Identifying the most appropriate food dish based on available kitchen ingredients remains a practical yet challenging task in everyday life. To address this, this study specifically aims to develop an intelligent food classification system using a multimodal approach. We propose a multimodal food classification method that performs early fusion by combining visual and textual features extracted using the Contrastive Language–Image Pretraining (CLIP) model. Features from food images and ingredient lists are fused and classified through a two-layer multilayer perceptron. The model is evaluated on the Recipes5k dataset with 4,826 samples across 101 food categories. Results show that the proposed multimodal model achieves 91.32% accuracy, outperforming text-only (85.65%) and image-only (57.26%) baselines. The main contribution of this work lies in demonstrating the effectiveness of early fusion for combining cross-modal representations in food classification. Unlike prior methods, our model supports flexible inference with either text or image input, enabling practical real-world applications. These findings highlight the potential of multimodal learning for food recommendation systems, offering both accuracy and contextual relevance beyond unimodal approaches.