Journal of Computer Science and Engineering (JCSE)
Vol 5, No 2: August (2024)

Class-Oriented Text Vectorization for Text Classification: Case Study of Job Offer Classification

Wabo Tatchum, Ghislain (Unknown)
Nzekon Nzeko'o, Armel Jacques (Unknown)
Sosso Makembe, Fritz (Unknown)
Youh Djam, Xaviera (Unknown)



Article Info

Publish Date
30 Aug 2024

Abstract

Advances in data science have made it possible to solve many real-life problems using automatic text classification applications. This is the case in e-recruitment, where job offers are classified and recommended to jobseekers. In natural language processing, text classification involves a vectorization step, whereby each document is represented as a vector of coordinates linked to a keyword. Those keywords are obtained by vectorizing the entire corpus, and are used to distinguish one document from another in the corpus. However, it is preferable for each keyword to distinguish one class from another. To obtain these types of keywords, the authors consider the class of documents in the vectorization process. They first create a class-oriented document for each class by merging all documents from the same class, and then apply a vectorization algorithm. Experiments are carried out using datasets from Minajobs, Nigham, and Monster with the classification models Decision Tree, Naive Bayes, Support Vector Machine, and a deep neural network self-attention transformer (TFM). The vectorization methods used on class-oriented documents are Doc2Vec and TF-IDF combined with our class-oriented vectorization strategies, including OC, ZIPF, and OWDC. To evaluate these experiments, we used the precision, MAP, and F1-Score metrics. According to the results, the TFM methods can improve accuracy by 29, 40, and 33% compared to previous work and the traditional way of classifying text documents. The NB methods can improve accuracy by 19, 22, and 20%, while the DT methods can improve accuracy by 34, 37, and 34%. The SVM methods can improve accuracy by 33, 34, and 34% in the Monster, Nigham, and Minajobs datasets. In addition, we validate our contribution by comparing ourselves with three other works in the literature using four datasets (RE'16, Wap, WebKB, and Kla) and obtain improvements in accuracy and F1-score up to 55%.

Copyrights © 2024






Journal Info

Abbrev

JCSE

Publisher

Subject

Computer Science & IT

Description

Computer Architecture, Processor design, operating systems, high-performance computing, parallel processing, computer networks, embedded systems, theory of computation, design and analysis of algorithms, data structures and database systems, theory of computation, design and analysis of algorithms, ...