Khairunnisa, Mutiarahmi
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Multi-Head Voting based on Kernel Filtering for Fine-grained Visual Classification Khairunnisa, Mutiarahmi; Wibowo, Suryo Adhi
JOIV : International Journal on Informatics Visualization Vol 9, No 2 (2025)
Publisher : Society of Visual Informatics

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.62527/joiv.9.2.2920

Abstract

Research on Fine-Grained Visual Classification (FGVC) faces a significant challenge in distinguishing objects with subtle differences within intra-class variations and inter-class similarities, which are critical for accurate classification. To address this complexity, many advanced methods have been proposed using feature coding, part-based components for modification, and attention-based efforts to facilitate different classification phases. Vision Transformers (ViT) has recently emerged as a promising competitor compared to other complex methods in FGVC applications for image recognition, which are mainly capable of capturing more fine-grained details and subtle inter-class differences with higher accuracy. While these advances have shown improvements in various tasks, existing methods still suffer from inconsistent learning performance across heads and layers in the multi-head self-attention (MHSA) mechanisms that result in suboptimal classification task performance. To enhance the performance of ViT, we propose an innovative approach that modifies the convolutional kernel.  Our method considerably improves the method's capacity to identify and highlight specific crucial characteristics required for classification by using an array of kernels. Experimental results show kernel sharpening outperforms other state-of-the-art approaches in improving accuracy across numerous datasets, including Oxford-IIIT Pet, CUB-200-2011, and Stanford Dogs. Our findings show that the suggested approach improves the method's overall performance in classification tasks by achieving more concentration and precision in recognizing discriminative areas inside pictures. Using kernel adjustments to improve Vision Transformers' ability to differentiate somewhat complicated visual features, our strategy offers a strong response to the problem of fine-grained categorization.