Claim Missing Document
Check
Articles

Found 1 Documents
Search

Building a Web Crawler for Text Data Indexing on Online Newspaper Web Hakim, Jamaludin; Sah, Andrian; Nurhayati, Siti; Ciptaningrum, Wahyu; Suryo Sasono, Damar
International Journal of Engineering, Science and Information Technology Vol 4, No 4 (2024)
Publisher : Department of Information Technology, Universitas Malikussaleh, Aceh Utara, Indonesia

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.52088/ijesty.v4i4.677

Abstract

The Internet has become a vast repository of information, often filled with distractions that can hinder the user experience. News content, for example, is usually interspersed with advertisements that interrupt the flow of reading. In addition, the fast pace of news publication is also a challenge, with potentially more than 50 new articles appearing in 20 minutes. This high-speed data flow is valuable for various applications, including Social Media Analytics Services. In this context, the speed and efficiency of data acquisition (crawling) and processing (scraping) are critical. These processes must be optimized to ensure comprehensive data collection without gaps, focusing on the latest information. To meet this need, we propose developing an application capable of capturing news data in its entirety, minimizing the risk of missing important information. At the core of this solution is a web crawler- a sophisticated program designed to automatically browse the hyperlink structure of the web, systematically downloading linked pages to local storage. This crawling methodology is often the basis for web mining initiatives and search engine development. Since web information is distributed across billions of pages hosted on millions of servers worldwide, our application utilizes the PHP programming language to capture and process this data effectively. The main goal is to present pure news content to users without any irrelevant elements. We use a Data Flow Diagram (DFD) to model the system architecture and data flow. This approach provides a clear visualization of how web users can navigate through hyperlinks to efficiently access the desired news information. By implementing this system, we aim to improve the user experience of consuming news content, facilitate more effective data analysis, and contribute to the broader web information search and processing field.