Classifying environmental sounds poses significant challenges because of their naturally disorganized characteristics. This research introduces a deep learning method for categorizing urban audio using the MobileViT architecture, which serves as a versatile, lightweight solution for various deep learning applications. The study utilizes the UrbanSound8k dataset, enhanced through multiple augmentation strategies including noise injection, time stretching, pitch modulation, and mixup methods. These augmentation techniques are essential given the dataset's size constraints and help create a more robust model for practical applications. Following augmentation, the audio undergoes preprocessing to standardize length and is transformed into mel spectrograms, making it compatible with MobileViT's input requirements. The model undergoes training with both standard and optimized parameters, achieving peak performance exceeding 80% accuracy. The integration of augmented data and parameter optimization yields approximately 15% improvement over the baseline MobileViT configuration while preserving rapid inference speeds of roughly 7 milliseconds. The findings prove that MobileViT represents a promising solution for various environmental sound applications.
Copyrights © 2025