International Journal of Advances in Intelligent Informatics
Vol 12, No 2 (2026): May 2026

Surgical-aware video masked autoencoders with phase-conditioned attention for laparoscopic action recognition

Hakim Nasaoui (LAROSERI Laboratory, Department of Computer Science, Faculty of Sciences, Chouaïb Doukkali University, El Jadida)
Hassan Silkan (LAROSERI Laboratory, Department of Computer Science, Faculty of Sciences, Chouaïb Doukkali University, El Jadida)
Insaf Bellamine (LSATE, Sidi Mohamed Ben Abdellah University ENSA, Fes)



Article Info

Publish Date
31 May 2026

Abstract

Fine-grained surgical action recognition in laparoscopic videos remains a challenge even with recent deep learning progress. While current VideoMAE approaches reach 89.11% accuracy on cholecystectomy tasks, they face specific limitations. Random masking strategies often miss surgical instruments that occupy only 10% to 15% of frames. Furthermore, context-independent models struggle with visually similar actions across different phases, and symmetric two-stream architectures tend to waste computational resources. To solve this, we developed SA-VideoMAE, a surgical-aware video masked autoencoder specifically designed for laparoscopic action recognition. Our method utilizes surgical-aware adaptive masking that integrates YOLOv7x object detection to prioritize instrument patches. This increased instrument visibility from 10% to 60% during training, ensuring the model focuses on action-relevant regions rather than static backgrounds. We also utilized phase-conditioned hierarchical attention to inject learnable phase embeddings into the attention mechanisms, enabling the model to disambiguate visually similar actions based on surgical context. For efficiency, our asymmetric dual-stream architecture processes RGB with ViT-Base (86M parameters) and optical flow with ViT-Tiny (5.7M parameters), which achieved a 47% parameter reduction compared to symmetric designs. Our training process then balanced reconstruction, classification, temporal consistency, and phase prediction through a novel multi-objective optimization strategy. Results on Cholec80's Calot's Triangle Dissection phase show 93.5% accuracy, representing a 4.4 percentage point improvement over the verified baseline. Notably, challenging action recall improved from 51% to 74% while maintaining real-time inference at 62ms per clip. These findings demonstrate that encoding surgical domain knowledge into video architectures significantly enhances action recognition performance.

Copyrights © 2026






Journal Info

Abbrev

IJAIN

Publisher

Subject

Computer Science & IT

Description

International journal of advances in intelligent informatics (IJAIN) e-ISSN: 2442-6571 is a peer reviewed open-access journal published three times a year in English-language, provides scientists and engineers throughout the world for the exchange and dissemination of theoretical and ...