Detecting urban housing and settlements has a substantial position in decision-making problems such as monitoring housing and development, not to mention the widelyrequired urban mapping application. One of the most important goals in the United NationsSustainable Development Goals (SDGs) is to improve urban living conditions globally by2030. We propose an automatic detection of urban housing and settlements on remote sensingsatellite imagery data using object detection-based deep learning using semantic segmentationand the potential availability of remote sensing datasets at high spatial resolutions, Open StreetMap (OSM) geolocation point of interest dataset, and Sentinel-2 optical satellite imagery data.The detection model using Mask Region-based Convolutional Neural Networks (Mask R-CNN) is implemented in Depok City, Indonesia. These regions were chosen because it is thesecond most populous suburb in Indonesia and the tenth most populous globally and, making itchallenging to extract building features from satellite imagery. This model categorizes dense,moderate, and sparse conditions and has a promising result of an average precision of 100%and an F1-score of 67% with evaluation performance metrics only considering pointsassociated with buildings, not building boundaries or the intersection over union (IoU). Themodel performance has been compared to ground check results of field surveys, and itperforms best in sparse conditions. Our findings offer the potential implementation of themodel for fast and accurate monitoring of housing, settlement, and regional planning in urbanareas.