When your CRM is filled with “Michael Smith” and “Mike Smith,” it isn’t just a database mess—it is a drain on your marketing ROI and a risk to your customer experience. This challenge, known as record linkage, is the process of identifying when different records actually represent the same entity.
In our research on record linkage, we examine the data structures used to de-duplicate records entering a system, whether in real-time (“online”) or in batch processes (“offline”). The core challenge is identifying if two records—such as credit card transactions or CRM entries—represent the same entity.
A primary method for managing these relationships is the use of disjoint sets, which support three main operations: Make-set, Union, and Find-set. These can be implemented with linked lists or as a forest of directed graphs. Each list or each graph represents one entity. Heuristics allow these data structures to be maintained in close to linear time, avoiding O(n^2) lookups or inserts.
Self-organizing lists (using a Move-to-Front or LRU-style approach) are another approach that can be effective if duplicates are co-located in the input stream. For example, a donor or contact appears multiple times with different addresses without many other rows in between. Using the Potential Method for amortized analysis, we can prove that this approach remains within a constant factor of an “optimal” oracle algorithm that knows the future data sequence. However, our real-world transaction data shows that most entities only appear 1–3 times even in large datasets, which limits the effectiveness of LRU-based caching.
For most enterprise-scale datasets where items are unique, a Lookup Table using hashing or prefix strategies is more efficient. This limits the number of comparisons per record, resulting in a total runtime of O(n x m), where m << n. Our testing indicates that as batch sizes increase, the total processing time for these lookup methods decreases, but this also means the lookup table will grow over time, so you will need to be careful about memory management.
Beyond traditional data structures, we are seeing a shift toward:
Ultimately, while disjoint sets provide a precise mathematical representation of merged records, the choice of structure depends heavily on your specific data recurrence patterns and the computational cost of your similarity functions.
All Rights Reserved. Copyright © 2021 Well & Lighthouse