Mohamed, Rozlina
Unknown Affiliation

Published : 2 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 2 Documents
Search

Automatic Topic-Based Web Page Classification Using Deep Learning Apandi, Siti Hawa; Sallim, Jamaludin; Mohamed, Rozlina; Ahmad, Norkhairi
JOIV : International Journal on Informatics Visualization Vol 7, No 3-2 (2023): Empowering the Future: The Role of Information Technology in Building Resilien
Publisher : Society of Visual Informatics

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30630/joiv.7.3-2.1616

Abstract

The internet is frequently surfed by people by using smartphones, laptops, or computers in order to search information online in the web. The increase of information in the web has made the web pages grow day by day. The automatic topic-based web page classification is used to manage the excessive amount of web pages by classifying them to different categories based on the web page content. Different machine learning algorithms have been employed as web page classifiers to categorise the web pages. However, there is lack of study that review classification of web pages using deep learning. In this study, the automatic topic-based classification of web pages utilising deep learning that has been proposed by many key researchers are reviewed. The relevant research papers are selected from reputable research databases. The review process looked at the dataset, features, algorithm, pre-processing used in classification of web pages, document representation technique and performance of the web page classification model. The document representation technique used to represent the web page features is an important aspect in the classification of web pages as it affects the performance of the web page classification model. The integral web page feature is the textual content. Based on the review, it was found that the image based web page classification showed higher performance compared to the text based web page classification. Due to lack of matrix representation that can effectively handle long web page text content, a new document representation technique which is word cloud image can be used to visualize the words that have been extracted from the text content web page.
Data Pre-processing of Website Browsing Records: To Prepare Quality Dataset for Web Page Classification Apandi, Siti Hawa; Sallim, Jamaludin; Mohamed, Rozlina; Ahmad, Norkhairi
JOIV : International Journal on Informatics Visualization Vol 8, No 1 (2024)
Publisher : Society of Visual Informatics

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.62527/joiv.8.1.1618

Abstract

The increased usage of the internet worldwide has led to an abundance of web pages designed to supply information to internet users. The use of web page classification is becoming increasingly necessary to organize the growing number of web pages. This classification model serves as a tool to restrict internet usage to specific categories of web pages. To develop the classification model, it’s crucial to check the quality of the dataset, as it determines the performance of the web page classification model. Raw datasets are typically unreliable and subject to noise, which complicates data analysis. This is why data pre-processing is necessary to prepare the dataset properly. In this study, website browsing records serve as the dataset. The primary goal of this paper is to investigate data pre-processing techniques for website browsing records, focusing on Game and Online Video Streaming web pages. Data pre-processing involves two main steps: data cleaning and web content pre-processing. After completing the data cleaning process, the datasets are reduced from the original. This demonstrates that many datasets can be eliminated due to their inactivity or unsuitability as the datasets for Game and Online Video Streaming web pages. Meanwhile, web content pre-processing removes noise from an HTML document, retaining only relevant words that can represent the web page by creating a word cloud image. Convolutional Neural Networks (CNN) will be used to construct a model for categorizing web pages to determine whether they fall under Game or Online Video Streaming. The pre-processed data will be used as the input for this model.