Gusti Lanang Putra Eka Prismana
Information System Department, Faculty of Engineering, Universitas Negeri Surabaya

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Automatic Web News Content Extraction Gusti Lanang Putra Eka Prismana
Journal Research of Social Science, Economics, and Management Vol. 1 No. 7 (2022): Journal Research of Social Science, Economics, and Management
Publisher : Publikasi Indonesia

Show Abstract | Download Original | Original Source | Check in Google Scholar | Full PDF (3193.007 KB) | DOI: 10.59141/jrssem.v1i7.107

Abstract

The extraction of the main content of web pages is widely used in search engines, but a lot of irrelevant information, such as advertisements, navigation, and junk information, is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. This study aimed to extract web pages using DOM Tree in the rationality of segmentation results and efficiency based on the information entropy of nodes from the DOM Tree. The first step of this research was to classify web page tags and only processed tags that affected the structure of the page. The second step was to consider the content features and structural features of the DOM Tree node comprehensively. The next was to perform node fusion to obtain segmentation results. Segmentation testing was carried out with several web pages with different structures so that it showed that the proposed method accurately and quickly segmented and removed noise from web page content. After the DOM Tree was formed, the DOM Tree would be matched with the database to eliminate information noise using the Firefly Optimization algorithm. Then, testing and evaluating the Firefly Optimization method in effectiveness aspect were done to detect and eliminate web page noise and produce clear documents.