Manual monitoring of surveillance video (CCTV) is inefficient and prone to human oversight. This drives the need for an automated violence detection system that is fast and accurate. Existing deep learning models are often too computationally heavy for real-time implementation, creating a dilemma between accuracy and efficiency. This research proposes a lightweight two-stream ConvLSTM architecture to address this dilemma. The method efficiently models spatio-temporal relationships by combining skeleton representation and change detection, which is then packaged through a frame grouping technique. The ConvLSTM layer serves as the main temporal model, supported by a SeparableConv2D backbone for efficient feature extraction. The model is trained on the RWF-2000 dataset and evaluated using cross-dataset validation on the Surveillance Camera Fight Dataset to test its generalization capability. The results show that the proposed model achieves superior performance with an accuracy and F1-Score of 74.00%, and is highly efficient with an inference speed of 518.45 FPS. This research demonstrates that the two-stream architecture combining skeleton representation, frame grouping, and ConvLSTM modeling successfully creates a robust, fast violence detection system, offering a practical solution for real-world monitoring applications.
Copyrights © 2025