Pneumonia is still one of the main causes of death around the world, especially in kids and older people. To lower the death rate, early and accurate diagnosis is very important. Chest X-ray (CXR) imaging is widely used for this purpose, but manual reading of CXR images can be time-consuming and may lead to differences in interpretation between observers. To address this problem, this study presents a pneumonia classification model based on the Vision Transformer (ViT) architecture combined with Gradient-weighted Class Activation Mapping (Grad-CAM) to make the model’s decisions more interpretable. The model was trained on a publicly available CXR dataset with 5,863 images that were split into Normal and Pneumonia classes, using a 70:15:15 split for training, validation, and testing. The ViT model achieves an accuracy of 96.41% on the test set and a high recall for pneumonia cases, while class weighted loss helps to maintain more balanced predictions between the two classes. The Area Under the Curve (AUC) of 0.975 indicates strong discrimination between pneumonia-positive and normal samples. Grad-CAM visualizations, supported by a randomization test and occlusion analysis, provide an initial qualitative view of the lung regions that influence the model’s predictions and often overlap with radiologically plausible areas. However, the heatmaps have not been formally evaluated by radiologists, and the correspondence between highlighted regions and pneumonia consolidation patterns has not yet been quantitatively validated. Therefore, the proposed ViT Grad-CAM framework should be regarded as an exploratory step toward explainable pneumonia classification on chest X-rays rather than a system that is ready for clinical deployment.