Personal data leakage is becoming an increasingly serious issue, especially when the leaked data has been partially modified to avoid direct matching with the original source. This study develops a fuzzy approach based on algorithmic mapping of each attribute (field-algorithm pairing) as well as a weighting scheme based on relevance, to support a many-to-one data match between the leaked data and the original database. Four algorithms are used: Levenshtein, Jaro-Winkler, Token Sort Ratio, and Cosine Similarity, selected based on the semantic characteristics of the attributes. Experiments were conducted on 10,000 synthetic data with various modification scenarios, including clean data, light modification, and weight modification Results showed high performance in both clean data and light modification (F1-score 0.90–1.00), but significantly decreased in heavy modification (F1-score 0.10–0.45). This approach offers a lightweight yet effective solution for the early stages of identity verification in data leak investigations, as well as opening up opportunities for further development through a combination of algorithms and adaptive adjustment of matching thresholds.
Copyrights © 2025