Voice Activity Detection (VAD) is a crucial pre-processing step for speech technologies, yet standard Conformer architectures suffer from quadratic computational complexity. This study introduces the Conformer-Performer, a novel architecture that replaces standard multi-head self-attention with the Fast Attention Via positive Orthogonal Random features (FAVOR+) mechanism to achieve linear complexity. The objective was to develop an efficient VAD model that maintains high accuracy suitable for resource-constrained applications. The model was trained on the multilingual FLEURS dataset using a teacher-student approach and extensive data augmentation. Experimental results demonstrate that the Conformer-Performer achieves an F1-score of 98.29%, which is highly competitive with the standard Conformer's 98.41%, while achieving a 7.8-fold reduction in peak GPU memory usage and a 3.46-fold speedup in CPU inference time. Furthermore, the proposed model significantly outperforms the SileroVAD baseline. These findings confirm that the Conformer-Performer offers a compelling balance of accuracy and efficiency, making it highly suitable for real-time, on-device speech processing.
Copyrights © 2025