Generating effective user stories is essential yet time-consuming in software development, especially in large scale Agile projects. This study evaluates the performance of three Large Language Models (LLMs): ChatGPT-4.0, DeepSeek, and Gemini 2.5 in generating user stories automatically. The objective is to compare their accuracy and precision to determine the most suitable model for automating requirements documentation. Using seven test prompts from various industry domains, each model generated user stories evaluated with BLEU-4, ROUGE-L F1, and METEOR metrics. Results show that while all models produced structurally valid outputs, Gemini 2.5 achieved the highest average scores (0.386), surpassing DeepSeek (0.355) and ChatGPT (0.348). Gemini 2.5 demonstrated superior consistency, clarity, and semantic completeness. This research contributes a performance benchmark for LLMs in software requirement generation and highlights the practical benefits of LLM-based automation over manual methods, including speed, consistency, and adaptability. Gemini 2.5 is recommended as the optimal model for generating user stories in software engineering contexts.
Copyrights © 2025