Claim Missing Document
Check
Articles

Found 2 Documents
Search
Journal : Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control

The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges Bastian, Ade; Wahid, Abrar; Hafsari, Zacky; Mardiana, Ardi
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control Vol. 10, No. 4, November 2025
Publisher : Universitas Muhammadiyah Malang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.22219/kinetik.v10i4.2305

Abstract

This study provides a comprehensive systematic literature review (SLR) of the evolution of image captioning models from 2017 to 2025, with a particular emphasis on the impending problems, methodological enhancements, and significant architectural developments. The evaluation is guided by the increasing demand for precise and contextually aware image descriptions, and it adheres to the PRISMA methodology. It selects 36 relevant papers from reputable scientific databases. The results indicate a significant transition from traditional CNN-RNN models to Transformer-based architectures, which leads to enhanced semantic coherence and contextual comprehension. Current methodologies, such as prompt engineering and GAN-based augmentation, have further facilitated generalization and diversity, while multimodal fusion solutions, which incorporate attention mechanisms and knowledge integration, have improved caption quality. Additionally, significant areas of concern include data bias, equity in model assessment, and support for low-resource languages. The study underscores the fact that modern vision-language models, such as Flamingo, GIT, and LLaVA, offer robust domain generalization through cross-modal learning and joint embedding. Furthermore, the efficacy of computing in restricted environments is improved by the development of pretraining procedures and lightweight models. This study contributes by identifying future prospects, analyzing technical trade-offs, and delineating research trends, particularly in sectors such as healthcare, construction, and inclusive AI. According to the results, in order to optimize their efficacy in real-world applications, future picture captioning models must prioritize resource efficiency, impartiality, and multilingual capabilities.
Maleo Emotion Audio Dataset Indonesia For Emotion Classification Mardiana, Ardi; Permana, Sri Mentari Widya Ningrum; Ii Sopiandi; Ade Bastian; Irawan, Eka Tresna
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control Vol. 11, No. 2, May 2026 (Article in Progress)
Publisher : Universitas Muhammadiyah Malang

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.22219/kinetik.v11i2.2474

Abstract

The limited availability of voice emotion datasets in Indonesian poses a challenge in the development of Speech Emotion Recognition (SER) systems, even though the need for such systems continues to grow in various sectors such as customer service, education, and human-computer interaction. To address this challenge, this study developed the Maleo Emotion Audio Dataset, a collection of three-second audio clips labeled with seven emotion categories: angry, neutral, disgusted, sad, happy, afraid, and surprised. The data was collected from the YouTube platform, and the Maleo Emotion Dataset is available at https://huggingface.co/datasets/maleo-ai/maleo-emotion. It was processed through preprocessing, feature extraction, and augmentation stages. The five main features extracted include Zero Crossing Rate, energy, Mel-Frequency Cepstral Coefficients (MFCC), spectral roll-off, and spectral flux. To enhance generalization, augmentation techniques such as pitch shifting, noise injection, and time stretching were applied. The classification model was built using a Convolutional Neural Network (CNN) architecture with TensorFlow-based implementation. Evaluation showed that the model achieved 94.48% accuracy on the test data, with balanced performance across all emotion categories. These results demonstrate that the developed dataset and model architecture have high capability in effectively recognizing emotions from Indonesian speech in a locally relevant context.