Garuda - Garba Rujukan Digital

Journal of ICT Research and Applications

Vol. 15 No. 3 (2021)

Gurjot Singh Mahi (Department of Computer Science, Punjabi University Patiala, Punjab, 147002 India)
Amandeep Verma (Department of Computer Science, Punjabi University Patiala, Punjab, 147002 India)

Publish Date
28 Dec 2021

Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Journal of ICT Research and Applications

Website

Abbrev

jictra

Publisher

Institut Teknologi Bandung

Subject

Computer Science & IT

Description

Journal of ICT Research and Applications welcomes full research articles in the area of Information and Communication Technology from the following subject areas: Information Theory, Signal Processing, Electronics, Computer Network, Telecommunication, Wireless & Mobile Computing, Internet ...

Article Info

Abstract

Development of Focused Crawlers for Building Large Punjabi News Corpus

Article Info

Abstract