Sidek, Zuleaizal
Unknown Affiliation

Published : 2 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 2 Documents
Search

Unsupervised outlier detection in high-dimensional text data: a comparative analysis Sidek, Zuleaizal; Ahmad, Sharifah Sakinah Syed; Teo, Noor Hasimah Ibrahim
Bulletin of Electrical Engineering and Informatics Vol 14, No 4: August 2025
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v14i4.9573

Abstract

Outlier detection in user reviews is a critical task for identifying anomalous and potentially valuable insights within large datasets. This study presents a comparative analysis of three different algorithms for outlier detection in user reviews: isolation forest, local outlier factor (LOF), and latent dirichlet allocation (LDA). The performance of each algorithm was evaluated using accuracy and silhouette score for outlier detection and clustering quality. LDA performed best with 0.98 accuracy and a silhouette score of 0.13. Isolation forest followed with 0.90 accuracy and a score of 0.11. LOF had lower results with 0.42 accuracy and a score of -0.05 due to its sensitivity to neighbors. The study contributes by systematically exploring the impact of parameter variations on algorithm performance, providing valuable insights for high-dimensional text data analysis. Despite the promising results, limitations include the dependence on preprocessing and specific parameter settings. Future work will explore hybrid approaches and broader datasets to enhance scalability and adaptability.
Exploring word embeddings and clustering algorithms for user reviews Sidek, Zuleaizal; Syed Ahmad, Sharifah Sakinah
Indonesian Journal of Electrical Engineering and Computer Science Vol 41, No 3: March 2026
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/ijeecs.v41.i3.pp1017-1024

Abstract

The rapid advancement of information technology has led to a significant surge in the volume of unstructured textual data. This has posed a major problem in terms of analyzing, organizing, and automatically clustering text for research purposes, which is crucial for extracting valuable insights. The process of manually clustering the unstructured data, such as customer reviews on the Internet, which capture the opinions of customers regarding products, services, and social events, requires significant financial resources, manpower, and time. Most of the studies are directed towards the analysis of sentiment in user reviews. In order to address the issues effectively, automated text clustering could assist in categorizing reviews into various themes, thereby simplifying the analysis process. Therefore, in this paper, we present and compare the result of experiment the combination of five text clustering techniques, namely K-means, fuzzy C-mean (FCM), non-negative matrix factorization (NMF), latent dirichlet allocation (LDA), and latent semantic analysis (LSA) with different embedding techniques, namely term frequency–inverse document frequency (TF-IDF), Word2Vec, and global vectors (GloVe). The experiments revealed that LDA is a reliable algorithm as it consistently produces good results across three-word embeddings. The highest Silhouette score recorded in the experiments was 0.66 using LDA and Word2Vec as word embedding. Simultaneously, the application of LSA in conjunction with Word2Vec yields superior outcomes, as evidenced by a Silhouette score of 0.65.