Claim Missing Document
Check
Articles

Found 1 Documents
Search

Developing an effective focused crawler to retrieve data of Indian-origin scientists and utilizing text classification for comparative analysis Gautam, Shivani; Bhatia, Rajesh; Jain, Shaily
International Journal of Electrical and Computer Engineering (IJECE) Vol 14, No 5: October 2024
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/ijece.v14i5.pp5468-5480

Abstract

This article presents the implementation of focused web crawling to retrieve data about scientists of Indian ancestry who are working in foreign nations. This study demonstrates the effectiveness of web scraping in obtaining large amounts of data from publicly available online pages. The objective is to construct a collection of data pertaining to Indian scientists who are now employed in national laboratories overseas. Collecting a vast quantity of data on the aforementioned Indian scientists through manual search is a pointless task. Therefore, this study proposes a detailed plan for a focused web crawler that can gather similar data. Subsequently, we present a comprehensive assessment of numerous classification models on this newly created dataset. Our assessments indicate that the random forest model surpasses the other supervised models. The empirical findings on large datasets demonstrated that the combination of random forest with synthetic minority oversampling technique (SMOTE) and k-fold cross-validation methods yielded better performance compared to K-nearest neighbors (KNN), support vector machine (SVM), and logistic regression (LR) for Indian origin scientists. Conversely, SMOTE with an 80-20 random split demonstrated superior performance on smaller datasets. Overall, the random forest classifier demonstrated the most favorable outcomes, attaining a micro-average area under curve (AUC) of 90%. The outcomes of our study provide a solid foundation for further investigation into classification of text of Indian origin scientists.