Garuda - Garba Rujukan Digital

Jurnal Teknik Informatika (JUTIF)

Vol. 7 No. 1 (2026): JUTIF Volume 7, Number 1, February 2026

Ichwani, Arief (Unknown)
Kesuma, Rahman Indra (Unknown)
Setiawan, Andika (Unknown)
Wicaksono, Imam Eko (Unknown)
Hanifah, Raidah (Unknown)

Publish Date
15 Feb 2026

Data leakage in machine learning classification often leads to overfitting, inflated performance estimates, and poor reproducibility, which can undermine the reliability of deployed models and incur industrial losses. This paper addresses the leakage problem by proposing an integrated machine learning pipeline that strictly isolates training and evaluation processes across preprocessing, feature transformation, and model optimization stages. Experiments are conducted on the Titanic passenger survival dataset, where exploratory data analysis identifies data quality issues, followed by stratified train-test splitting to preserve class distribution. All preprocessing steps, including missing value imputation, categorical encoding, and feature scaling, are applied exclusively to the training data using a ColumnTransformer embedded within a unified Pipeline. A K-Nearest Neighbors (KNN) classifier is employed, with hyperparameters optimized via GridSearchCV and 3-fold cross-validation. Experimental results show that a baseline model without leakage control achieves only 72.62% test accuracy and exhibits a substantial overfitting gap. In contrast, the proposed pipeline-based approach improves generalization, achieving 78.21% test accuracy with an optimal configuration of k = 29 and Manhattan distance while significantly reducing overfitting. The main contribution of this work is the formulation of a reproducible, leakage-aware pipeline guideline that ensures unbiased evaluation and reliable generalization in classification tasks, providing practical methodological insights for both academic research and real-world machine learning applications.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Jurnal Teknik Informatika (JUTIF)

Website

Abbrev

jurnal

Publisher

Universitas Jenderal Soedirman

Subject

Computer Science & IT

Description

Jurnal Teknik Informatika (JUTIF) is an Indonesian national journal, publishes high-quality research papers in the broad field of Informatics, Information Systems and Computer Science, which encompasses software engineering, information system development, computer systems, computer network, ...

Article Info

Abstract

Preventing Data Leakage in Classification via Integrated Machine Learning Pipelines: Preprocessing, Feature Transformation, and Hyperparameter Tuning

Article Info

Abstract