JOIN (Jurnal Online Informatika)
Vol 11 No 1 (2026)

Maleo-Short: An "In-the-Wild" Indonesian Dataset for Speaker Diarization

Mardiana, Ardi (Unknown)
Muslimah, Dinda Desmonda (Unknown)
Bastian, Ade (Unknown)
Irawan, Eka Tresna (Unknown)



Article Info

Publish Date
24 Apr 2026

Abstract

Speaker diarization (SD), the task of partitioning an audio stream into speaker-homogenous segments, is fundamental for analyzing multi-speaker recordings. Its application to “in-the-wild” data, such as content from the YouTube platform, poses significant challenges, including overlapped speech, ambient noise, and rapid speaker turns, thereby constituting an active research area. While numerous SD datasets are available, they predominantly focus on English and other high-resource languages. A notable scarcity of publicly accessible datasets exists for the Indonesian language, as extant corpora are primarily engineered for Automatic Speech Recognition (ASR). To address this resource deficit, this research introduces Maleo-Short, a new Indonesian multi-speaker dataset derived from YouTube. The dataset comprises 110 short conversational clips, with a total duration of 1 hours 32 minutes. A reliable ground truth was established through a meticulous manual annotation process using ELAN to generate precise speaker segmentation and transcription files. To validate its utility and assess its complexity, the dataset was evaluated using pre-trained baseline models. The empirical results confirm its status as a challenging benchmark, with the most effective models achieving a Diarization Error Rate (DER) of 32.64% and a Word Error Rate (WER) of 33.78%. Maleo-Short is presented as a valuable, publicly accessible resource intended to catalyze advancements in Indonesian speaker diarization research by facilitating the development and rigorous evaluation of SD systems on acoustically complex and realistic conversational data. Maleo-Short is available at https://doi.org/10.57967/hf/7944.  

Copyrights © 2026






Journal Info

Abbrev

join

Publisher

Subject

Computer Science & IT

Description

JOIN (Jurnal Online Informatika) is a scientific journal published by the Department of Informatics UIN Sunan Gunung Djati Bandung. This journal contains scientific papers from Academics, Researchers, and Practitioners about research on informatics. JOIN (Jurnal Online Informatika) is published ...