Smoking behavior in public spaces remains a major challenge in the implementation of public health policies, particularly within designated smoke-free zones. This study aims to examine whether architectural improvements and spatio-temporal modeling in object detection models can enhance the accuracy of real-time smoking behavior detection. Specifically, the performance of YOLOv8 and an experimental version, YOLOv11, is compared using a vision-based approach. A dataset of 3,000 annotated images is used, consisting of smoking and non-smoking activities such as drinking or phone use, with variations in lighting, body posture, and camera angles. The dataset was divided into 80% for training, 20% for validation, and 20% for testing, with data augmentation applied to improve generalization. YOLOv11 incorporates spatio-temporal modules and attention mechanisms not present in YOLOv8. Evaluation results show that YOLOv11 outperforms YOLOv8, achieving a Precision of 0.95, Recall of 0.91, and F1-Score of 0.93, while YOLOv8 reached 0.89, 0.87, and 0.88 respectively. These findings indicate that YOLOv11 offers a more robust and adaptive solution for automatically recognizing smoking behavior in real-world environments and supports the development of intelligent surveillance systems for enforcing smoke-free policies.