Claim Missing Document
Check
Articles

Found 11 Documents
Search

YoloV8, EfficientNetv2, and CSP Darknet Comparison as Recognition Model’s Backbone for Drone-Captured Images Kridalukmana, Rinta; Eridani, Dania; Septiana, Risma; Windasari, Ike Pertiwi
JOIV : International Journal on Informatics Visualization Vol 9, No 2 (2025)
Publisher : Society of Visual Informatics

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.62527/joiv.9.2.2880

Abstract

Artificial intelligence (AI) has recently empowered drones to support smart city apps and recognize on-the-ground objects or events. Various pre-trained backbones are available to develop object recognition models, and some of them could boost the models’ accuracy. Consequently, it becomes difficult for practitioners to select a suitable backbone as a feature extractor during recognition model development. Hence, this research aims to provide a benchmark examining the performance of three popular backbones in supporting recognition models using images captured by drones as the dataset. This research used the UAV-AUAIR dataset and compared three deep learning backbone architectures as the feature extractor, namely YoloV8_s, EfficientNetv2_s, and CSP_DarkNet_l. The head part of each selected backbone was replaced with YoloV8Detector architecture, provided by Keras-CV, to perform the inference tasks. The models generated during training were evaluated against four measurement methods: loss function, intersection over union (IOU), across-scale mean average precision (mAP), and computational performance. The results showed that the model generated using EfficientNetv2_s backbone outperformed the others in most criteria, except the computational performance and detecting small objects, which was won by YOLOV8_s and CSP_Darknet_l, respectively. Thus, EfficientNetv2_s and CSP_DarkNet_l can be considered if app development concerns accuracy. Meanwhile, YoloV8_s is far better when computational performance is essential, as its prediction time achieved 0.8 seconds per image. This study is essential as a reference for practitioners, particularly those who want to develop an object-recognition model based on a pre-trained backbone.