Garuda - Garba Rujukan Digital

Journal of Applied Data Sciences

Vol 7, No 1: January 2026

Gheartha, I Gusti Bagus Yogiswara (Unknown)
Adiwijaya, Adiwijaya (Unknown)
Romadhony, Ade (Unknown)
Ardiansyah, Yusfi (Unknown)

Publish Date
19 Dec 2025

Automated interview systems powered by artificial intelligence often rely on fine-tuned models and annotated datasets, limiting their adaptability to new leadership competency frameworks. Large language models have shown potential for generating questions and assessing answers, yet their zero-shot performance, operating without task-specific retraining remains underexplored in leadership assessment. This study examines the zero-shot capability of two models, Qwen 32B and GPT-4o-mini, within a multi-turn self-interview framework. Both models dynamically generated questions, interpreted responses, and assigned scores across ten leadership competencies. Professionals representing the role of Digital Marketing and Account Manager participated, each completing two AI-led interview sessions. Model outputs were evaluated by certified experts using a structured rubric across three dimensions: quality of behavioral insights, relevance of follow-up questions, and fit of assigned scores. Results indicate that Qwen 32B generated richer insights than GPT-4o-mini (mean = 2.86 vs. 2.62; p less than 0.01) and provided more differentiated assessments across competencies. GPT-4o-mini produced more consistent follow-up questions but lacked depth in interpretation, often yielding generic outputs. Both models struggled with accurate scoring of candidate responses, reflected in low answer score ratings (Qwen mean = 2.35; GPT mean = 2.21). These findings suggest a trade-off between insight richness and scoring stability, with both models demonstrating limited ability to fully capture nuanced leadership behaviors. This study offers one of the first empirical benchmarks of zero-shot model performance in leadership interviews. It underscores both the promise and current limitations of deploying such systems for scalable assessment. Future research should explore competency-specific prompt strategies, fairness evaluation across demographic groups, and domain-adapted fine-tuning to improve accuracy, reliability, and ethical alignment in high-stakes recruitment contexts.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Journal of Applied Data Sciences

Website

Abbrev

JADS

Publisher

Bright Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management

Description

One of the current hot topics in science is data: how can datasets be used in scientific and scholarly research in a more reliable, citable and accountable way? Data is of paramount importance to scientific progress, yet most research data remains private. Enhancing the transparency of the processes ...

Article Info

Abstract

Assessing Large Language Models for Zero-Shot Dynamic Question Generation and Automated Leadership Competency Assessment

Article Info

Abstract