Speech processing with the help of lip detection and lip reading is an advancing field. For this, we need proper algorithms and techniques to detect lips and movements of lips perfectly. Lip detection and configuration are the most important parts of speech recognition. In this paper, we focus on detecting the lip segment properly. Mask R-CNN (Regional Convolutional Neural Network) performs object detection and instance segmentation per video frame to detect the lip segment. The process of mask R-CNN adds only a small overhead to Faster R-CNN and is quite simple to train, running at 5 frames per second. The Mask R-CNN involves keypoint detection which helps to extract the location of the lip landmarks pixel by pixel. Once the lip region is extracted and the landmarks are highlighted, we observe how the lip landmarks change as the object's lips move over time to each Bengali word. The keypoint changes that are observed during each millisecond are then the landmarks used to train the GLM (Generalized Linear Model). In addition, we compare the performance of GLM with Naive Bayes, Logistic Regression, and Decision Tree. The GLM has exhibited the highest 91.8% accuracy, whereas the Naive Bayes, Logistic Regression, and Decision Tree show the accuracy of 87.1%, 38.3%, and 82.2%, respectively.
Copyrights © 2024