Mathematical formulae in academic papers or scientific journals are an important part of said documents. However, mathematical formulae are oftentimes not properly recognized by Optical Character Recognition (OCR) processes. One of the causes of this failure is the difference between mathematical formulae and ordinary text. Therefore, mathematical formula detection in those document pages might help with this problem. The formula detection is done by converting digital document pages into images, then performing text line segmentation and word segmentation and classifying those results with a Convolutional Neural Network. The aim is to help OCR processes by recognizing which parts of the document pages contain formulae and which parts do not. The CNN architectures used to perform classification comes with 64 kernels in each convolutional layer. For displayed formulae (formulae that doesn’t share its space with regular text), the model uses 10 groups of Convolutional-ReLU-Max Pooling layers. For inline formulae (formulae that shares its text line with regular text), 12 groups of Convolutional-ReLU-Max Pooling layers are used. Results of the CNN architectures mentioned above are an F1 score of 0,980 for displayed formulae classification in 1-column documents, 0,940 for 2-column documents, and 0,916 for inline formulae.
Copyrights © 2019