Urban traffic congestion creates a unique environment where drivers are often captive audiences to roadside fast-food outlets and advertisements. This paper proposes a vision-driven impulsive purchase prediction system that simulates human-like vehicle vision using a Vision Transformer (ViT) model to detect fast-food outlet visibility, crowd levels, and promotional banner exposure in real-time. By integrating these visual cues, our system predicts the likelihood of impulsive stopping behavior (the “impulse score”) of drivers in heavy traffic. We collected and analyzed visual data from congested thoroughfares in major Indonesian cities (Jakarta, Surabaya, Bandung) known for severe traffic jams. The proposed ViT-based model was trained to identify key features such as recognizable outlet signage, drive-thru queue lengths, and promotional signage, mirroring the attention patterns of human drivers. Experimental results demonstrate that the model achieves high accuracy in detecting relevant cues and predicting impulsive purchase decisions, with a mean absolute percentage error (MAPE) of around 12% in forecasting impulse stop rates. This work is the first to leverage a transformer-driven computer vision approach for modeling consumer impulsivity in traffic, bridging automotive perception and marketing analytics. The findings suggest that smart vehicle systems and urban planners can benefit from such technology to anticipate consumer behavior in traffic, optimize roadside advertising, and manage congestion-related demand surges at fast-food outlets.