This study applies the Vision Transformer (ViT) method to soil-type classification and evaluates its accuracy using digital images. The Vision Transformer (ViT) is a Deep Learning architecture that uses self-attention to extract global features from images, enabling it to recognize texture and color patterns more comprehensively than other convolutional methods. The dataset used consists of eight soil types, each containing 77 image data in “.jpg” format. Each image was processed and augmented to increase the dataset to 700 images per soil type. This was done to prevent overfitting. The ViT model was trained using a 90%/10 % split of the training and validation datasets. The results show that the Vision Transformer (ViT) architecture achieves an accuracy of 71.4%, with a small difference between training and validation accuracy, indicating that the ViT method generalizes well. This method successfully distinguishes soil types such as alluvial, andosol, laterite, and limestone based on subtle visual differences. These results demonstrate that the Vision Transformer (ViT) is effective for soil type classification and has the potential to serve as a basis for developing a digital image-based soil type classification system to support research and precision agriculture.
Copyrights © 2026