Journal of Applied Data Sciences
Vol 7, No 1: January 2026

Assessing Large Language Models for Zero-Shot Dynamic Question Generation and Automated Leadership Competency Assessment

Gheartha, I Gusti Bagus Yogiswara (Unknown)
Adiwijaya, Adiwijaya (Unknown)
Romadhony, Ade (Unknown)
Ardiansyah, Yusfi (Unknown)



Article Info

Publish Date
19 Dec 2025

Abstract

Automated interview systems powered by artificial intelligence often rely on fine-tuned models and annotated datasets, limiting their adaptability to new leadership competency frameworks. Large language models have shown potential for generating questions and assessing answers, yet their zero-shot performance, operating without task-specific retraining remains underexplored in leadership assessment. This study examines the zero-shot capability of two models, Qwen 32B and GPT-4o-mini, within a multi-turn self-interview framework. Both models dynamically generated questions, interpreted responses, and assigned scores across ten leadership competencies. Professionals representing the role of Digital Marketing and Account Manager participated, each completing two AI-led interview sessions. Model outputs were evaluated by certified experts using a structured rubric across three dimensions: quality of behavioral insights, relevance of follow-up questions, and fit of assigned scores. Results indicate that Qwen 32B generated richer insights than GPT-4o-mini (mean = 2.86 vs. 2.62; p less than 0.01) and provided more differentiated assessments across competencies. GPT-4o-mini produced more consistent follow-up questions but lacked depth in interpretation, often yielding generic outputs. Both models struggled with accurate scoring of candidate responses, reflected in low answer score ratings (Qwen mean = 2.35; GPT mean = 2.21). These findings suggest a trade-off between insight richness and scoring stability, with both models demonstrating limited ability to fully capture nuanced leadership behaviors. This study offers one of the first empirical benchmarks of zero-shot model performance in leadership interviews. It underscores both the promise and current limitations of deploying such systems for scalable assessment. Future research should explore competency-specific prompt strategies, fairness evaluation across demographic groups, and domain-adapted fine-tuning to improve accuracy, reliability, and ethical alignment in high-stakes recruitment contexts.

Copyrights © 2026






Journal Info

Abbrev

JADS

Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management

Description

One of the current hot topics in science is data: how can datasets be used in scientific and scholarly research in a more reliable, citable and accountable way? Data is of paramount importance to scientific progress, yet most research data remains private. Enhancing the transparency of the processes ...