Detection and classification of coffee cherries based on maturity levels present a significant challenge in agricultural product processing systems, primarily due to the high visual similarity among classes within a single bunch. This study aims to develop a multi-class detection and classification system for coffee cherries by integrating YOLOv8 and Vision Transformer (ViT) as a classification enhancer. The initial detection process is conducted using YOLOv8 to identify and automatically crop coffee cherry objects from bunch images. These cropped images are then re-classified using the Vision Transformer to improve prediction accuracy. The training process was carried out with a learning rate of 0.0001, a batch size of 16, and epoch variations of 50, 100, and 150. Evaluation results demonstrate that the integration of YOLOv8 and ViT significantly improves classification accuracy compared to using YOLOv8 alone. At 100 epochs, the YOLOv8+ViT model achieved an accuracy of 89.52%, a precision of 90.43%, and a recall of 89.52%, outperforming the standalone YOLOv8 model, which only reached an accuracy of 75.44%. These results indicate that the Vision Transformer effectively enhances classification performance, particularly for visually similar coffee cherry classes. The integration of these two methods offers a promising alternative solution for improving image-based multi-class classification in agriculture and other domains involving complex visual objects.
                        
                        
                        
                        
                            
                                Copyrights © 2025