In the domain of deep learning-driven computer vision, YOLO is revolutionary. However, not all YOLO models are accompanied by academic articles and architectural diagrams. It complicates the comprehension of the model's operation. Moreover, the existing review papers fail to examine each model comprehensively. This work aims to provide a thorough comparative analysis of the architectures from YOLOv8 to YOLO11, allowing readers to swiftly understand the operational mechanisms and differences among the models. We analyzed the architecture of each YOLO version by reviewing relevant scholarly articles, official documentation, and examining the source code. In particular, we discovered that YOLOv8 through YOLO11 differ in novelty while sharing similarities in the anchor-free and Non-Maximum Suppression (NMS) aspects, except YOLOv10 (NMS-free). Each also has drawbacks, such as differing levels of complexity in the way features are connected (v8), architectural structure and training (v9), training methods or dual assignments (v10), inference, and code implementation (v11). While each version improves architecture, some blocks remain unchanged. This study helps readers understand different YOLO version architectures and inspires how to improve their performance. It also provides readers with a comprehensive architecture diagram and detailed descriptions of each block, serving as a reference for both academic and practical applications. In terms of performance, a benchmark using the Roboflow 100 dataset reveals that YOLOv9 achieves superior accuracy; however, it is eight times slower owing to its NMS mechanism. YOLOv10 is the fastest but least accurate, whereas YOLOv8 and YOLO11 provide a balanced compromise between speed and accuracy.
Copyrights © 2026