River waste accumulation has become a serious environmental problem in urban areas, particularly in highly polluted rivers such as the Angke River in Tangerang, where floating waste disrupts ecological balance and increases flood risk. Conventional computer vision–based detection methods often fail under dynamic river conditions due to water surface reflections, turbulence, occlusion, and visually ambiguous debris. This study aims to improve the accuracy and robustness of river waste detection by proposing a hybrid deep learning framework that integrates convolutional and graph-based spatial–contextual reasoning. The proposed method utilizes a ResNet50 backbone for feature extraction from CCTV imagery, followed by spatial graph construction that models adjacency relationships between image regions. A Graph Attention Network (GAT) is then applied to capture contextual dependencies and refine feature representations prior to classification. Unlike conventional CNN-only or YOLO-based detectors that rely primarily on local visual cues and bounding-box representations, the proposed approach explicitly models spatial–contextual relationships between image regions through graph-based attention mechanisms. Experiments were conducted on 4,200 CCTV image frames collected from the Angke River under varying environmental conditions. The proposed model achieved an accuracy of 92.4%, precision of 91.1%, recall of 93.2%, F1-score of 91.9%, and a mean Average Precision (mAP) of 0.78, outperforming CNN-only and YOLO-based baseline models. These findings highlight the contribution of graph-enhanced visual reasoning to the fields of Computer Vision and Intelligent Surveillance, particularly for real-time environmental monitoring systems operating in complex and dynamic visual environments.