This paper presents a zero-shot learning framework based on Contrastive Language Image Pretraining (CLIP) for Remote Sensing Scene Classification (RSSC). The proposed method addresses the challenge of imbalanced image quantities across different categories, which is often encountered in practical ap-plications. Traditional zero-shot learning methods in RSSC leverage pre-trained word embeddings to extract semantic features from category names or descriptions, which are then fixed during the learning process without adaptation to visual features. This leads to a gap between visual and semantic representations. We have integrated the slot deposit 5000 Vision Transformer with CLIP to enhance the alignment between visual and semantic features. Extensive experiments conducted on WHU-RS19 dataset demonstrate the effectiveness of the proposed framework, show-casing improved classification performance and generalization capabilities.
Copyrights © 2025