This study introduced a methodology for real-time object detection and interpretability using YOLOv8s, trained on the MS common objects in context (COCO) dataset. The system captured live webcam footage, processes frames resized to 640×384, and applies YOLOv8s to detect objects with bounding boxes, labels, and confidence scores. YOLOv8s architecture comprising a CSPDarknet53-based backbone, neck, and head ensures efficient feature extraction and accurate detection. To enhance model transparency, activation map generation is implemented by attaching forward hooks to intermediate convolutional layers. Feature maps are captured during the forward pass, averaged, normalized, and resized to match the original image dimensions. This visualization highlights regions influencing the model’s predictions, aligning with explainable artificial intelligence (XAI) principles. Experimental results demonstrate high detection accuracy and effective interpretability in indoor environments, making the framework suitable for robotics applications requiring both precision and transparency. The proposed method offers a practical and explainable solution for real-time scene understanding in intelligent systems.