Purbalingga is a region located in Central Java Province, offering interesting natural beauty and tourist destinations. Many tourists capture their moments in photos, which are then uploaded to social media. However, a picture can contain a lot of information, and each individual may interpret it differently. Without captions, people may struggle to extract this information. Image captioning addresses this challenge by automatically generating text descriptions for images. Additionally, text-to-speech is used to enhance accessibility for the visually impaired in understanding image descriptions. This research aims to develop an image captioning model for images of tourist attractions in Purbalingga using transformer architecture and ResNet50. The transformer architecture employs an attention mechanism to learn the context and relationships between inputs and outputs, while ResNet50 is a robust convolutional network for image feature extraction. Model evaluation using BLEU metrics, which compare generated sentences to reference sentences, shows the best results as BLEU-{1, 2, 3, 4} = {0.672, 0.559, 0.489, 0.437}. Experiments indicate that increasing embeddings and layers extends training time and lowers BLEU scores, while changing the number of heads has minimal impact on results. The best model is implemented in a web-based application using the SDLC waterfall method, Flask framework, and MySQL database. This application allows users to upload tourist attraction images, receive automatic descriptions in Indonesian, and listen to the captions read aloud using the Web Speech API-based text-to-speech feature. Blackbox testing results show valid outcomes for all tests, indicating that the application operates as required and is suitable for use.