The swift advancements in Artificial Intelligence and Machine Learning have rendered datasets essential; nonetheless, their heightened utilization has engendered intricate ethical dilemmas that are frequently neglected. This study seeks to delineate and highlight ethical concerns associated with the collection of primary data and the reutilization of secondary datasets in computer science research. We employed a Systematic Literature Review (SLR) methodology in accordance with the PRISMA 2020 guidelines, examining 72 publications sourced from five esteemed academic databases (Scopus, Web of Science, IEEE Xplore, ACM Digital Library, Google Scholar) published from 2021 to 2025. The study results indicate that ethical difficulties emerge uniformly in both primary and secondary datasets. Primary datasets primarily face challenges related to privacy threats, anonymization, and Informed Consent, whereas secondary datasets are more susceptible to licensing infringements, dataset repurposing, and insufficient preparation transparency. The three domains that predominantly encountered these challenges were Machine Learning, Computer Vision, and Natural Language Processing. Moreover, practices of data manipulation, including cherry-picking and concealed preparation, were identified as detrimental to scientific integrity. This study's findings underscore the need for enhanced ethical standards for datasets and greater transparency in preparation documentation to ensure the repeatability of data-driven research.
Copyrights © 2026