The use of video as a medium for information and education is rapidly increasing across online platforms. However, long durations and unstructured delivery often hinder audiences from grasping the core message, presenting challenges for the development of automatic summarization methods for monologues, interviews, and podcasts. Extractive methods often yield less coherent summaries, while abstractive methods may overlook important details. To address this issue, this study proposes a hybrid approach combining extractive and abstractive techniques. In the extractive stage, sentences are represented using BERT embeddings and clustered using two methods, namely K-Means Clustering and Hierarchical Clustering (agglomerative). The abstractive stage then employs the BART model to generate summaries that are more coherent and informative. Experimental evaluations on 20 Human Metapneumovirus (HMPV) videos indicate the strongest performance on monologues, with ROUGE-1 of 57%, ROUGE-2 of 30%, and ROUGE-L of 32%. Although lower performance was observed for interviews and podcasts due to dynamic interactions and frequent speaker shifts, the hybrid approach consistently surpassed extractive-only and abstractive-only baselines. These results highlight the effectiveness of the hybrid approach and its potential for developing more adaptive video summarization in the future.
Copyrights © 2025