Skeletal malocclusion, a common orthodontic condition, affects jaw function and dental health. It is often caused by genetic factors, abnormal growth, bad habits, or trauma. Conventional diagnostic models often fail to generalize across diverse datasets, leading to overfitting and poor test performance. This study aimed to improve diagnostic accuracy by incorporating Vision Attention mechanisms into a custom Convolutional Neural Network (CNN), enabling the model to focus on critical regions in X-ray images. A total of 491 radiographic images depicting facial skeletal structures with various malocclusion types (Classes 1, 2, and 3) were used in this study. A custom CNN was developed and evaluated both with and without attention mechanisms—specifically, Scaled Dot Product Attention and Multihead Attention—to assess their impact on classification performance. The baseline CNN without attention achieved an accuracy of 0.68. With Scaled Dot Product Attention, accuracy improved to 0.72, while Multihead Attention achieved the highest accuracy of 0.76. Evaluation using weighted average precision, recall, and F1-score showed that attention mechanisms significantly enhanced the model’s ability to differentiate between malocclusion classes. Notably, the Multihead Attention model yielded the best performance, reducing misclassification errors and improving generalization. Confusion matrix analysis revealed that it had the lowest classification errors, especially in distinguishing between Class 0 and Class 1. These results suggest that incorporating attention mechanisms, particularly Multihead Attention, enhances CNN performance by improving feature extraction and classification accuracy. Future research should explore more diverse datasets and implement advanced augmentation techniques to improve clinical reliability.