Demographic attribute recognition particularly race and gender classification from facial images, plays a critical role in applications ranging from precision healthcare to digital identity systems. However, existing deep learning approaches often suffer from algorithmic bias and limited robustness, especially when trained on imbalanced or non-representative data. To address these challenges, this study proposes MD-ViT, a novel framework that leverages multidomain Vision Transformer (ViT) fusion to enhance both accuracy and fairness in demographic classification. Specifically, we integrate embeddings from two task-specific pretrained ViTs: ViT-VGGFace (fine-tuned on VGGFace2 for structural identity features) and ViT-Face Age (trained on UTKFace and IMDB-WIKI for age-related morphological cues), followed by classification using XGBoost to model complex feature interactions while mitigating overfitting. Evaluated on the balanced DemogPairs dataset (10,800 images across six intersectional subgroups), our approach achieves 89.07% accuracy and 89.06% F1-score, outperforming single-domain baselines (ViT-VGGFace: 88.61%; ViT-Age: 78.94%). Crucially, fairness analysis reveals minimal performance disparity across subgroups (F1-score range: 87.38%–91.03%; σ = 1.33), indicating effective mitigation of intersectional bias. These results demonstrate that cross-task feature fusion can yield representations that are not only more discriminative but also more equitable. We conclude that MD-ViT offers a principled, modular, and ethically grounded pathway toward fairer soft biometric systems, particularly in high-stakes domains such as digital health and inclusive access control.
Copyrights © 2025