Journal of Applied Data Sciences
Vol 7, No 1: January 2026

Predicting Gender from Online Dating Self-Introductions Using Machine Learning, Deep Learning, and DistilBERT

Gonzalez Casanova, Lionel F. (Unknown)
Chen, Wen-Ju (Unknown)
Wei, Hsi-Sheng (Unknown)



Article Info

Publish Date
14 Jan 2026

Abstract

This study investigates a novel approach to automated gender classification in online dating profiles by comparing models that span traditional machine learning, deep learning, and transformer-based architectures. The dataset consists of self-introduction essays from publicly accessible repositories and enriched with psychological features (LIWC), lexical features (bag-of-words), and contextual representations (raw text). The primary objective is to evaluate predictive performance, robustness, and computational cost across these modeling strategies and to assess their trade-offs. A comprehensive preprocessing pipeline was implemented, including missing-value handling, text cleaning, LIWC feature extraction, Bag-of-Words vectorization, one-hot encoding of categorical variables, and class-imbalance mitigation through random oversampling. Text augmentation using synonym replacement was subsequently applied to increase data diversity while maintaining realistic linguistic patterns. Stratified five-fold cross-validation was used for traditional models and LIWC-only deep learning experiments, and StratifiedKFold (k = 5) was applied to LIWC + BoW configurations to ensure balanced splits. DistilBERT was fine-tuned on raw essay data using an 80/20 train–test split under GPU memory and batch-size constraints. Across three runs, DistilBERT achieved an average testing accuracy of 91% ± 1%, with precision, recall, F1-score, and ROC–AUC indicating balanced performance. A GRU trained on LIWC+BoW features reached 88.62% ± 0.53% accuracy, offering competitive results at substantially lower computational cost. An MLP trained solely on LIWC features provided a stable and interpretable baseline. Confusion matrices showed balanced predictions between male and female classes, highlighting the importance of feature representation and model selection. Overall, the findings demonstrate clear trade-offs between computational demand and semantic modeling capability. These results contribute to ongoing research on gender identification and guide future work on fairness, robustness, and explainability in AI-assisted user profiling. The study also underscores practical benefits for automated analysis of unstructured text in social and psychological applications, while recognizing ethical considerations related to non-binary and gender-fluid individuals.

Copyrights © 2026






Journal Info

Abbrev

JADS

Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management

Description

One of the current hot topics in science is data: how can datasets be used in scientific and scholarly research in a more reliable, citable and accountable way? Data is of paramount importance to scientific progress, yet most research data remains private. Enhancing the transparency of the processes ...