Journal of Applied Data Sciences
Vol 7, No 1: January 2026

Self-consistency and Graph-based Filtering to Enhance Synthetic Arabic SMS Generation for Smishing Detection

Alotaibi, Amal (Unknown)
Almasre, Miada (Unknown)
Surougi, Hadeel (Unknown)
Alkhozae, Mona (Unknown)
Alghanmi, Nouf (Unknown)



Article Info

Publish Date
19 Dec 2025

Abstract

Smishing or SMS phishing is a growing cybersecurity threat in mobile security, with Arabic-speaking regions particularly vulnerable due to the absence of large, labeled datasets. The main objective of this study is to develop a scalable pipeline that can generate and classify Arabic SMS messages to overcome the lack of data and enhance detection performance. The contributions are threefold: (i) constructing a balanced dataset of 6,903 messages by combining 903 synthetic samples with 6,000 real Arabic SMS messages; (ii) introducing a hybrid generation framework that integrates a fine-tuned GPT-3.5-turbo language model with Conditional WGAN embeddings, refined using self-consistency sampling and graph-based redundancy filtering; and (iii) evaluating the dataset using multiple machine learning (Logistic Regression, Random Forest, SVM) and deep learning (CNN, BERT) models. The pipeline unifies adversarial embedding generation, large language model fine-tuning, and cosine similarity filtering. Experimental results show consistently strong performance: Logistic Regression and Random Forest both achieved accuracy of 0.9949 and F1-score of 0.9950, while SVM outperformed all with accuracy 0.9957 and F1-score 0.9957. Among deep learning models, CNN reached accuracy 0.9942 and F1-score 0.9942, and BERT achieved 0.9900 across all metrics. These findings confirm that while SVM is most effective for this dataset, CNN and BERT add robustness by capturing semantic subtleties. Visual analyses, including confusion matrices and t-SNE projections, validated the overlap between real and synthetic embeddings, while comparative tables positioned this study within the context of recent Arabic smishing research. The novelty of this work lies in combining self-consistency and graph-based filtering within a hybrid generation-classification pipeline tailored for Arabic SMS, providing a reproducible framework extendable to low-resource, multilingual, and cross-platform environments such as WhatsApp and Telegram.

Copyrights © 2026






Journal Info

Abbrev

JADS

Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management

Description

One of the current hot topics in science is data: how can datasets be used in scientific and scholarly research in a more reliable, citable and accountable way? Data is of paramount importance to scientific progress, yet most research data remains private. Enhancing the transparency of the processes ...