The rapid increase in online meetings has produced massive amounts of undocumented spoken content, creating a practical need for automatic summarization. For Indonesian, this task is hindered by a dual-faceted resource scarcity and a lack of foundational benchmarks for pipeline components. This paper addresses this gap by creating a new synthetic conversational dataset for Indonesian and conducting two systematic, discrete benchmarks to identify the optimal components for an end-to-end pipeline. First, we evaluated six Whisper ASR model variants (from tiny to turbo) and found a clear, non-obvious winner: the turbo (distil-large-v2) model was not only the most accurate (7.97% WER) but also one of the fastest (1.25s inference), breaking the expected cost-accuracy trade-off. Second, we benchmarked 13 zero-shot summarization models on gold-standard transcripts, which revealed a critical divergence between lexical and semantic performance. Indonesian-specific models excelled at lexical overlap (ROUGE-1: 17.09 for cahya/t5-base...), while the multilingual google/long-t5-tglobal-base model was the clear semantic winner (BERTScore F1: 67.09).
Copyrights © 2026