This study developed a data cleaning system for master data using the Sorted Neighborhood Method (SNM) and N-gram methods to detect and eliminate duplicates and standardize name and address formats. The proposed SNM algorithm handles precleaning tasks, removes specific characters and titles, and forms tokens for comparison. The N-gram algorithm calculates record similarity using user-defined N-gram values and thresholds. The effectiveness was evaluated using recall, precision, and F-measure metrics on small and large datasets. The optimal threshold, token length, and N-gram values were 0.7, 5, and 2, respectively, yielding the highest F-measure scores. The results confirm the successful implementation and improvement of data quality. Identifying optimal parameters provides a benchmark for future data-cleaning efforts, potentially streamlining processes and reducing resources.
Copyrights © 2025