Topic modeling (TM) is an unsupervised technique used to recognize hidden or abstract topics in large corpora, extracting meaningful patterns of words (semantics). This paper explores TM within data mining (DM), focusing on challenges and advancements in extracting insights from datasets, especially from social media platforms (SMPs). Traditional techniques like latent Dirichlet allocation (LDA), alongside newer methodologies such as bidirectional encoder representations from transformers (BERT), generative pre-trained transformers (GPT), and extra long-term memory networks (XLNet) are examined. This paper highlights the limitations of LDA, prompting the adoption of embedding-based models like BERT and GPT, rooted in transformer architecture, offering enhanced context-awareness and semantic understanding. The paper emphasizes leveraging pre-trained transformer-based language models to generate document embedding, refining TM and improving accuracy. Notably, integrating BERT with XLNet summaries emerges as a promising approach. By synthesizing insights, the paper aims to inform researchers on optimizing TM techniques, potentially shifting how insights are extracted from textual data.
Copyrights © 2025