Garuda - Garba Rujukan Digital

Journal of Applied Data Sciences

Vol 7, No 1: January 2026

Gonzalez Casanova, Lionel F. (Unknown)
Chen, Wen-Ju (Unknown)
Wei, Hsi-Sheng (Unknown)

Publish Date
14 Jan 2026

This study investigates a novel approach to automated gender classification in online dating profiles by comparing models that span traditional machine learning, deep learning, and transformer-based architectures. The dataset consists of self-introduction essays from publicly accessible repositories and enriched with psychological features (LIWC), lexical features (bag-of-words), and contextual representations (raw text). The primary objective is to evaluate predictive performance, robustness, and computational cost across these modeling strategies and to assess their trade-offs. A comprehensive preprocessing pipeline was implemented, including missing-value handling, text cleaning, LIWC feature extraction, Bag-of-Words vectorization, one-hot encoding of categorical variables, and class-imbalance mitigation through random oversampling. Text augmentation using synonym replacement was subsequently applied to increase data diversity while maintaining realistic linguistic patterns. Stratified five-fold cross-validation was used for traditional models and LIWC-only deep learning experiments, and StratifiedKFold (k = 5) was applied to LIWC + BoW configurations to ensure balanced splits. DistilBERT was fine-tuned on raw essay data using an 80/20 train–test split under GPU memory and batch-size constraints. Across three runs, DistilBERT achieved an average testing accuracy of 91% ± 1%, with precision, recall, F1-score, and ROC–AUC indicating balanced performance. A GRU trained on LIWC+BoW features reached 88.62% ± 0.53% accuracy, offering competitive results at substantially lower computational cost. An MLP trained solely on LIWC features provided a stable and interpretable baseline. Confusion matrices showed balanced predictions between male and female classes, highlighting the importance of feature representation and model selection. Overall, the findings demonstrate clear trade-offs between computational demand and semantic modeling capability. These results contribute to ongoing research on gender identification and guide future work on fairness, robustness, and explainability in AI-assisted user profiling. The study also underscores practical benefits for automated analysis of unstructured text in social and psychological applications, while recognizing ethical considerations related to non-binary and gender-fluid individuals.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Journal of Applied Data Sciences

Website

Abbrev

JADS

Publisher

Bright Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management

Description

One of the current hot topics in science is data: how can datasets be used in scientific and scholarly research in a more reliable, citable and accountable way? Data is of paramount importance to scientific progress, yet most research data remains private. Enhancing the transparency of the processes ...

Article Info

Abstract

Predicting Gender from Online Dating Self-Introductions Using Machine Learning, Deep Learning, and DistilBERT

Article Info

Abstract