Background of study:  Accurate fruit classification is vital for agricultural automation, yet traditional methods are often subjective and inefficient. Convolutional Neural Networks (CNNs) are effective but struggle with global context in fine-grained tasks. Vision Transformers (ViTs), inspired by NLP models, offer global attention mechanisms that may improve classification in complex scenarios.Aims and scope of paper: This study compares the performance of EfficientNet-B0 (a CNN model) and ViT-B/16 (a Transformer model) on a fruit classification task involving five fruit types. The goal is to evaluate their strengths and weaknesses under controlled experimental conditions using a moderately sized dataset.Methods: A dataset of 10,000 fruit images was preprocessed with standard augmentation techniques and split into training and validation sets. Both models were fine-tuned using pretrained weights. Performance was evaluated using accuracy, precision, recall, F1-score, and confusion matrices.Result: EfficientNet-B0 achieved higher overall accuracy (94%) than ViT-B/16 (92%). The CNN model performed consistently across all classes, particularly excelling in bananas and strawberries. ViT-B/16 showed superior results for strawberries but struggled with apples. Confusion matrices revealed class-specific strengths and weaknesses.Conclusion: EfficientNet-B0 is better suited for general fruit classification due to its balanced performance, while ViT-B/16 excels in capturing fine-grained visual features. A hybrid approach may leverage both models’ strengths for enhanced performance in real-world applications.
                        
                        
                        
                        
                            
                                Copyrights © 2025