This study compares extractive and generative approaches for automatic summarization of Indonesian meeting minutes. Our main scientific contribution is an empirical claim that, under strict zero-shot conditions and without domain adaptation, simple extractive baselines are more reliable than off-the-shelf generative models in preserving both decision content and meeting-context cues (actors/roles). We evaluate three extractive baselines (Lead-3, Random-Extract, TextRank-Simple) against an Indonesian GPT-2 model tested under multiple decoding configurations and an mT5 sequence-to-sequence model in a zero-shot setting. Experiments utilize 30 manually curated meeting minutes. The dataset size is intentionally limited because meeting minutes are heterogeneous and require carefully constructed reference summaries to ensure evaluation validity; the study is positioned as a controlled diagnostic comparison rather than a training or adaptation effort. Performance is measured using ROUGE-1/2/L, summary–to–reference length ratios, simple audits of gender and professional role mentions, correlations between decoding parameters and ROUGE, and paired t-tests. Results show that extractive methods achieve higher and more stable ROUGE scores than zero-shot generative models. TextRank-Simple and Random-Extract perform best, while all GPT-2 configurations remain substantially lower, and mT5 zero-shot fails to align with references. Decoding parameters exhibit only weak correlations with generative performance, and paired t-tests confirm statistically significant differences (p < 0.05). Overall, extractive approaches remain the most dependable choice without in-domain fine-tuning, while generative models are more suitable with adaptation or hybrid strategies.This study compares extractive and generative approaches for automatic summarization of Indonesian meeting minutes. Our main scientific contribution is an empirical claim that, under strict zero-shot conditions and without domain adaptation, simple extractive baselines are more reliable than off-the-shelf generative models in preserving both decision content and meeting-context cues (actors/roles). We evaluate three extractive baselines (Lead-3, Random-Extract, TextRank-Simple) against an Indonesian GPT-2 model tested under multiple decoding configurations and an mT5 sequence-to-sequence model in a zero-shot setting. Experiments utilize 30 manually curated meeting minutes. The dataset size is intentionally limited because meeting minutes are heterogeneous and require carefully constructed reference summaries to ensure evaluation validity; the study is positioned as a controlled diagnostic comparison rather than a training or adaptation effort. Performance is measured using ROUGE-1/2/L, summary–to–reference length ratios, simple audits of gender and professional role mentions, correlations between decoding parameters and ROUGE, and paired t-tests. Results show that extractive methods achieve higher and more stable ROUGE scores than zero-shot generative models. TextRank-Simple and Random-Extract perform best, while all GPT-2 configurations remain substantially lower, and mT5 zero-shot fails to align with references. Decoding parameters exhibit only weak correlations with generative performance, and paired t-tests confirm statistically significant differences (p < 0.05). Overall, extractive approaches remain the most dependable choice without in-domain fine-tuning, while generative models are more suitable with adaptation or hybrid strategies.