This study investigates the implementation and effectiveness of an OpenAI Whisper-based automatic speech recognition (ASR) system for evaluating and improving the speaking skills of Indonesian EFL students. Employing a mixed-methods, one-group pretest-posttest design, the research involved 40 undergraduate students from Universitas Muhammadiyah Gresik. Quantitative data were collected through standardized speaking tests rated by both the Whisper system and expert human assessors, focusing on fluency, pronunciation, and coherence. Qualitative insights were obtained from classroom observations and in-depth interviews with students and lecturers, exploring user experiences and contextual factors influencing system performance. The results demonstrate that the Whisper-based assessment system achieved high inter-rater reliability with human experts (Cohen’s Kappa = 0.81; ICC = 0.87) and yielded significant improvements in learners’ speaking skills across all assessed dimensions, with the most notable gains in pronunciation. The system’s immediate, actionable feedback fostered greater learner engagement and self-directed improvement. However, the study also identified critical contextual factors—such as technological infrastructure, digital literacy, and classroom environment—that influenced the system’s effectiveness and reliability. These findings highlight the need for robust infrastructure, comprehensive teacher training, and equitable access to technology to maximize the benefits of AI-driven assessment. This research advances both theory and practice by validating a multidimensional, context-adaptive framework for AI-based speaking evaluation and providing practical guidelines for integrating advanced ASR technology into EFL curricula. The study’s implications inform educators, policymakers, and technologists seeking scalable, objective, and equitable solutions for language assessment in Indonesia and similar educational contexts.