Artificial intelligence (AI) has recently empowered drones to support smart city apps and recognize on-the-ground objects or events. Various pre-trained backbones are available to develop object recognition models, and some of them could boost the models’ accuracy. Consequently, it becomes difficult for practitioners to select a suitable backbone as a feature extractor during recognition model development. Hence, this research aims to provide a benchmark examining the performance of three popular backbones in supporting recognition models using images captured by drones as the dataset. This research used the UAV-AUAIR dataset and compared three deep learning backbone architectures as the feature extractor, namely YoloV8_s, EfficientNetv2_s, and CSP_DarkNet_l. The head part of each selected backbone was replaced with YoloV8Detector architecture, provided by Keras-CV, to perform the inference tasks. The models generated during training were evaluated against four measurement methods: loss function, intersection over union (IOU), across-scale mean average precision (mAP), and computational performance. The results showed that the model generated using EfficientNetv2_s backbone outperformed the others in most criteria, except the computational performance and detecting small objects, which was won by YOLOV8_s and CSP_Darknet_l, respectively. Thus, EfficientNetv2_s and CSP_DarkNet_l can be considered if app development concerns accuracy. Meanwhile, YoloV8_s is far better when computational performance is essential, as its prediction time achieved 0.8 seconds per image. This study is essential as a reference for practitioners, particularly those who want to develop an object-recognition model based on a pre-trained backbone.
Copyrights © 2025