Indonesian Journal of Electrical Engineering and Computer Science
Vol 26, No 1: April 2022

Classification based topic extraction using domain-specific vocabulary: a supervised approach

Vandana Kalra (Manav Rachna International Institute of Research Studies)
Indu Kashyap (Manav Rachna International Institute of Research Studies)
Harmeet Kaur (Hansraj College, University of Delhi)



Article Info

Publish Date
01 Apr 2022

Abstract

Recently, a probabilistic topic modelling approach, latent dirichlet allocation (LDA), has been extensively applied in the arena of document classification. However, classical LDA is an unsupervised algorithm implemented using a fixed number of topics without prior domain knowledge and generates different outcomes with the change in the order of documents. This article presents a comprehensive framework to evade the order effect and unsupervised probabilistic nature. First, the framework creates the vocabulary specific to the category using a weight-dependent model that extracts distinctive features suitable for supervised classification. Then, it transforms a classified cluster of documents from the domain corpus to the relevant topic making it more robust to noise. The framework was tested on a comprehensive collection of benchmark news datasets that vary in sample size, class characteristics, and classification tasks. In contrast to the conventional classification methods, the proposed framework achieved 95.56% and 95.23% accuracy when applied on two datasets, indicating that the proposed algorithm has a better classification capability. Furthermore, the topics extracted from the classified clusters are highly relevant to domain categories.

Copyrights © 2022