Coffee is one of the most popular beverage commodities consumed worldwide. The process of selecting high-quality coffee beans plays a vital role in ensuring that the resulting coffee has superior taste and aroma. Over the years, various deep learning models based on Convolutional Neural Networks (CNN) have been developed and utilized to classify coffee bean images with impressive accuracy and performance. However, recent advancements in deep learning have introduced novel transformer-based architectures that show great promise for image classification tasks. By incorporating a self-attention module, transformer models excel at generating global context features within images. This ability demonstrate improved and more consistent performance compared to CNN-based models. This study focuses on training and evaluating transformer-based deep learning models specifically for the classification of coffee bean images. Experimental results demonstrate that transformer models, such as the Vision Transformer (ViT) and Swin Transformer, outperform traditional CNN-based models. Swin Transformer model achieves excellent on the coffee bean image classification task, with 95.13% Accuracy and 90.21% F1-Score, while ViT achieves 94.47% Accuracy and 88.93% F1-Score. It indicates their strong capability in accurately identifying and classifying different types of coffee beans. This suggests that transformer-based approaches could be a better alternative for coffee bean image classification tasks in the future.