Visual odometry estimation (VOE) is important in building navigation and pathfinding systems. It helps entities find their way and estimate paths in the environment. Most of the computer vision (CV)-based VOE models are usually evaluated and compared on the KITTI dataset. Multi-layer fusion framework (MLF-VO-F) has had good VOE results from red, green, and blue (RGB) image sequence in Jiang et al. study, using the DeepNet to extract the low-level textures, edges, and deeper high-level semantic features for estimating motion between consecutive frames. This paper proposed a combined model of MLFVO-F as a backbone and loss functions (LFs) (LMSE, LMSE−L2, LCE, and Lcombi) to optimize and supervise the training process of the VOE model. We evaluated and compared the effectiveness of LFs for VOE based on the KITTI and TQU-SLAM datasets with the original MLF-VO-F. From there, choose the appropriate LF combined with the backbone for VOE. The evaluation results on the KITTI dataset show that LCE(RT E is 0.075m, 0.06m on the Seq. #9, Seq. #10, respectively), and Lcombi (trel is 2.21%, 2.67%, 3.59%, 1.01%, and 4.62% on the Seq. #4, Seq. #5, Seq. #6, Seq. #7, Seq. #10, respectively) have the lowest errors and LMSE has the highest errors (AT E is 133.36m on the Seq. #9).
Copyrights © 2025