Claim Missing Document
Check
Articles

Found 1 Documents
Search

Predicting Gender from Online Dating Self-Introductions Using Machine Learning, Deep Learning, and DistilBERT Gonzalez Casanova, Lionel F.; Chen, Wen-Ju; Wei, Hsi-Sheng
Journal of Applied Data Sciences Vol 7, No 1: January 2026
Publisher : Bright Publisher

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.47738/jads.v7i1.979

Abstract

This study investigates a novel approach to automated gender classification in online dating profiles by comparing models that span traditional machine learning, deep learning, and transformer-based architectures. The dataset consists of self-introduction essays from publicly accessible repositories and enriched with psychological features (LIWC), lexical features (bag-of-words), and contextual representations (raw text). The primary objective is to evaluate predictive performance, robustness, and computational cost across these modeling strategies and to assess their trade-offs. A comprehensive preprocessing pipeline was implemented, including missing-value handling, text cleaning, LIWC feature extraction, Bag-of-Words vectorization, one-hot encoding of categorical variables, and class-imbalance mitigation through random oversampling. Text augmentation using synonym replacement was subsequently applied to increase data diversity while maintaining realistic linguistic patterns. Stratified five-fold cross-validation was used for traditional models and LIWC-only deep learning experiments, and StratifiedKFold (k = 5) was applied to LIWC + BoW configurations to ensure balanced splits. DistilBERT was fine-tuned on raw essay data using an 80/20 train–test split under GPU memory and batch-size constraints. Across three runs, DistilBERT achieved an average testing accuracy of 91% ± 1%, with precision, recall, F1-score, and ROC–AUC indicating balanced performance. A GRU trained on LIWC+BoW features reached 88.62% ± 0.53% accuracy, offering competitive results at substantially lower computational cost. An MLP trained solely on LIWC features provided a stable and interpretable baseline. Confusion matrices showed balanced predictions between male and female classes, highlighting the importance of feature representation and model selection. Overall, the findings demonstrate clear trade-offs between computational demand and semantic modeling capability. These results contribute to ongoing research on gender identification and guide future work on fairness, robustness, and explainability in AI-assisted user profiling. The study also underscores practical benefits for automated analysis of unstructured text in social and psychological applications, while recognizing ethical considerations related to non-binary and gender-fluid individuals.