The extraction of building footprints from aerial photos and satellite imagery plays a crucial role in change detection, urban development, and detecting encroachments on agricultural land. Deep neural networks offer the capability of extracting features and provide accurate methods for detecting and extracting building footprints from satellite imagery. Image segmentation, the process of dividing an image into coherent parts, can be accomplished using two types: semantic segmentation and instance segmentation. Convolutional neural networks (CNN) are commonly used for both instance and semantic segmentation tasks. In this paper, we propose a hybrid approach to extracting building footprints from low-resolution satellite imagery using instance segmentation techniques. Our analysis demonstrates that the mask region-based CNN (R-CNN) architecture with a ResNet-34 backbone and PointRend head to improve the bounding-boxes and mask prediction achieves the highest performance, as evidenced by various metrics, including an average precision (AP) score of 83.39% and an F-1 score of 85.71%. This approach holds promise for developing automated tools to process satellite imagery, benefiting fields such as agriculture, land use monitoring, and disaster response.