Classifying and comprehending human behavior in the information provided is known as human movement detection. There are numerous real-world uses for it. Human movement tracking can be used in residential surveillance to monitor senior citizens' behavioral patterns and quickly identify risky behaviors such as falls. It can also assist an automated navigation system in analyzing and forecasting walking patterns. Notably, this system exhibits resilience against changing conditions like weather or light, whereas camera-based approaches falter in these situations. This study presents the AI-based cross-attention transformer framework for multimodal sensor fusion in smart surveillance and human activity detection systems, referred to as CrossTrans-Surv. CrossTrans-Surv, which draws inspiration from STAR-Transformer, integrates asynchronous visual (RGB), infrared/thermal, and LiDAR modalities via cross-attention layers that discover common representations across various data types. Pairs of multispectral images can offer combined knowledge about increasing the robustness and dependability of recognition applications in the real world. In contrast to earlier CNN-based studies, our network uses the Transformer approach to integrate global contextual information as well as learn dependencies that span distance during the feature extraction step. Next, we feed Transformer RGB frames and component heatmaps at various time and location qualities. We employ fewer layers for attention in the framework stream since the skeleton heat diagrams are important features as opposed to those initial RGB frames. Our methodology is appropriate for real-world AI-powered surveillance applications because it provides comprehensibility through consideration maps and scalability through modular design, in addition to performance advantages.
Copyrights © 2025