This study aims to evaluate the performance of four artificial intelligence models—CodeT5, CodeBERT, StarCoder, and GPT-4 (simulated)—in the code summarization task, which involves generating summaries or documentation for simple Python code snippets. The dataset consists of Python comment and code pairs, processed into documentation–code format to support the summarization process. The evaluation was conducted using BLEU and ROUGE-L metrics to measure the agreement between the model-generated summaries and the original documentation. The results show that GPT-4 (simulated) performed best with a BLEU score of 0.61 and ROUGE-L of 0.72, indicating superior context understanding capabilities. Among the open-source models, CodeT5 achieved the highest performance (BLEU 0.42 and ROUGE-L 0.55). CodeBERT produced an intermediate score, while StarCoder obtained the lowest score because its optimization is more geared towards code completion than code summarization. This study concludes that model selection should be tailored to the needs. CodeT5 is recommended for implementing open-source automated documentation systems, offering a good balance between performance and accessibility. Meanwhile, GPT-4 can be used as a reference model for high-accuracy applications. This research contributes to the field of software engineering by highlighting the potential of AI models to improve the efficiency and automation of code documentation processes.
Copyrights © 2026