The use of deep learning models for speaker identification on devices with limited computational resources requires significant architectural optimization. This study evaluates the performance and robustness of the Lightweight Audio Spectrogram Transformer (AST) architecture, which has been extremely compressed to 570,536 parameters. The proposed method uses low-resolution Mel-Spectrogram representations (64x64 pixels) as input for a global self-attention mechanism. Testing was conducted using a 5-Fold Cross Validation scheme on a dataset injected with non-stationary environmental noise from the ESC-50 corpus at various Signal-to-Noise Ratio (SNR) levels. Experimental results show that under ideal conditions, the model achieves a solid average validation accuracy of 70.86% ± 2.69% with a Macro Average F1-score of 0.68 ± 0.03. However, the model’s performance degrades sharply to 17.61% at an SNR of 5 dB and drops to 9.21% under extreme conditions at an SNR of 0 dB. These findings reveal a critical trade-off where radical parameter compression leads to the loss of spectral feature redundancy that acts as an implicit noise filter. This study concludes that while lightweight Transformer mechanisms are highly efficient for Edge AI, the integration of pre-processing modules or noise-robust training strategies is an absolute necessity to maintain identification integrity in noisy real-world environments.
Copyrights © 2026