Abstract−In the digital era, learning videos are increasingly being used, however, they often contain irrelevant information, making it difficult to comprehend the content. This study proposes an approach based on the Whisper and T5 models to generate text summaries from YouTube educational video transcripts. Whisper is used for speech-to-text transcription, focusing on model variants that offer a low Word Error Rate (WER) and time efficiency. Subsequently, the T5 model is fine-tuned to produce accurate text summaries, with a strategy of segmenting the transcript to address input length limitations. Text preprocessing is not applied as it resulted in better evaluation quality. The results show that the combination of Whisper Turbo and the optimized T5 model provides the best performance, with F1-Scores on the ROUGE metrics of 39.23 (ROUGE-1), 13.17 (ROUGE-2), and 23.84 (ROUGE-L). This approach successfully generates more relevant and comprehensive text summaries, enhancing the effectiveness of video-based learning. Therefore, this research makes a significant contribution to the development of text summarization technology for learning videos.