Environmental Sound Classification (ESC) faces significant challenges related to data scarcity and unstructured acoustic signal variability. This study evaluates the effectiveness of a Visual Transfer Learning approach by transforming audio signals into Mel-Spectrogram representations for classification using Computer Vision architectures. A comparative study was conducted on the ESC-50 dataset, benchmarking visual-based models (EfficientNet-B0, ResNet-50) against specialized audio models (Pre-trained Audio Neural Networks/PANNs). Experimental results demonstrate that EfficientNet-B0, optimized with MixUp augmentation, achieved the highest performance with 83.33% accuracy and 83.50% F1-Score, outperforming ResNet-50 (80.00%) and significantly surpassing the PANNs (Cnn14) model, which only reached 66.33%. The underperformance of PANNs indicates issues with over-parameterization on small-scale datasets. Further validation using Gradient-weighted Class Activation Mapping (Grad-CAM) confirmed that the EfficientNet-B0 model precisely learned semantic features by distinguishing active sound patterns from silence and background noise. These findings confirm that lightweight visual architectures possess superior transferability and generalization compared to massive audio models in data-constrained scenarios.
Copyrights © 2026