The core challenge in object detection for autonomous systems lies in maintaining accuracy across extreme object scales, particularly for small, distant targets. This study conducts a quantitative performance comparison between two distinct deep learning architectures: the CNN-based YOLOv8-m and the Vision Transformer (ViT)-based YOLOS. Both models were implemented and evaluated on a custom vehicle detection dataset. YOLOv8-m was trained from scratch, while YOLOS was evaluated using a proxy precision method on a pre-trained model to gauge its inherent capability in contextual reasoning. The results, analyzed using Mean Average Precision (mAP) categorized by object scale (mAPS,mAPM,mAPL), reveal a significant architectural trade-off. YOLOv8 demonstrated superior overall performance and excelled in mAPL (Large objects), affirming the strength of CNNs in local feature extraction. Conversely, YOLOS showed higher precision for mAPS (Small objects), suggesting that the global attention mechanism of ViT is more effective for long-range surveillance where objects are scarce in pixels. This research provides evidence-based guidance for selecting the optimal detection architecture based on the target object scale and application scenario.
Copyrights © 2025