Course:CPSC522/Record Linkage and identity uncertainty

From UBC Wiki
Jump to navigation Jump to search

Record Linkage and Identity Uncertainty

Paper referred:

  • Jin, L., Li, C., & Mehrotra, S. (2003). Efficient record linkage in large data sets[1]. Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings.
  • Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration[2]. Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery - DMKD '04.

Principal Author: Dandan Wang


Integrating data from multiple sources(e.g. websites, databases etc.) is a challenging task. Nowadays, there exist numerous different algorithms for record linkage or de-duplication, it is still impossible to achieve perfect accuracy and high efficiency - large time needs or restrictions on the number of datasets that they can integrate are still serious problems. In this page, I will introduce some basic concepts about Identity Uncertainty and compares the basic record linkage algorithm from Efficient Record Linkage in Large Data Sets[1] and a more advanced and efficient algorithms from Iterative Record Linkage for Cleaning and Integration [2] which handle any number of datasets and outperforms previous algorithms.

Builds on

Identity Uncertainty refers to the task of determining if two different descriptions refer to the same object(e.g. person, song etc.). It is a crucial and challenging problem that arises in areas like Citation Matching, Record Linkage and De-duplication in Databases.

Identity Uncertainty


Images from two surveillance cameras roughly two miles apart on Highway 99 in Sacramento, California. The left image is from the upstream camera, and the right image is from the downstream camera. Are the two boxed vehicles the same?

Identity Uncertainty is a pervasive problem in data management and analysis. It arises whenever and wherever objects are not labeled by unique identifiers or when those identifiers may not be perceived perfectly. Consequently, two observations may or may not correspond to the same object.

For example, images of vehicles that recorded by highway video cameras, how can the system figure out duplicate captures? The IP address of a personal computer is logged at multiple websites, how can the systems figure out duplicate visit records; and, a patient’s medical record might be recorded in different hospitals' databases, can all those duplicate records be merged?

It is important and necessary to do record linkage to reduce duplications. Too much duplicate data would affect the efficiency of the record processing progress and the accuracy of data usage and analysis. (e.g. Nobody would be happy that if one is charge twice for over speed limit on highway just because that it is captured twice, and the government would use more time to detect duplication when charge, which needs more time than the fine worth.)

Possible Solutions

A patient's medical records are differently recorded by different hospitals.

This problem plays a key role in Data integration, natural language understanding, information processing on the World-wide Web, and on the emerging Semantic Web. Taking the key attributes that may represent unique identities into consideration, and avoiding other noise factors that to some extend would affect the result or lead to unnecessary waste of resources(e.g. time, cost etc.) are the basic rules that we need to obey in this task.

Record Linkage

In our data driven society, there is an ever-increasing demand for the incorporation of new technologies to gather data on people for a variety of worthwhile endeavors. Concurrently, data collections have become commodities that can be shared, licensed, or sold for profit in many different communities. This is possible because both data subjects and data collectors consider these fragments innocuous. They often harbor the belief that their pieces of data are isolated and no one could systematically relate identity to them without a central registry to query.

How data distributed across databases came

An entity's information can be distributed across databases for a number of reasons.

  • collection at different time
  • collection on different sets of attributes
  • collection at different locations

In the first two data distribution types (i.e. different time and different attributes), linkage methods must be designed to account for incomplete, distorted, or possibly corrupt entity-specific information. The latter (i.e. different locations) can manifest in the form of typographical errors, such as when the name "John" is represented by "Jon". In addition, such methods need to be wary of ambiguous values, as well as variations of an entity's information. For example, the name "Jonathan Doe" and "Jon Doe" could refer to the same entity.

Methods of Record Linkage

Modern computer-based record linkage methods fall into two categories: deterministic linkage and probabilistic linkage.

  • Deterministic linkage identifies matches between records via comparing the unique identifiers or using an exact comparison between fields (i.e. shared fields). Widely, deterministic linkage may also employ almost-exact comparisons based on predetermined rules defined by users. Deterministic linkage is not the best option if your data has quality issues (e.g. typos, missing data ), as all but the most trivial datasets do.

  • Probabilistic linkage calculates a score for two records to predict how likely both refer to the same entity. The total score is the sum of scores generated by the comparison of individually weighted fields. Based on the score, record pairs will fall above or below predetermined thresholds and will be identified as matches, non-matches, or uncertain matches which should also get closer scrutiny before determining whether they should be considered as a match or not.

Paper 1: Distance Metrics of Attributes

Paper 2: Iterative Record Linkage


There are many different solutions of record linkage. But there is not a best solution, because different data structure requires different methodologies, it depends. For example, the first proposed theory is a very basic and widely applied one in any cases, but it can not be the only one, which is auxiliary.

The second proposed one is to based on the first theory but rely on a iterative or sequential relationship between each proper clusters, which is more powerful and practical that performs better than attribute-based clustering. And to most of the cases in reality, it is more applicable than the first proposed solution itself. From the test data we can see that the second proposed solution is able to discover duplicates that are not easy to identify with fixed threshold attribute similarity approaches. It should be noted that their algorithm is not a substitute for attribute similarity algorithms. It presents an iterative framework for combining attribute distances and group distances. So the second proposed algorithm here should produce better results with improved attribute and group distance measures in this case.

Not only the proposed two solutions, there are many other choices as mentioned previously. Sequential and parallel algorithms for record linkage to merge data across data sources not matter how many the sources are; Learning learning data identity via trail solves problem of online IP and unique user matching problem. There is not a perfect solution or algorithm that solves any problem, but after analyzing the issue and the data structure, you will find the proper one.

Annotated Bibliography

  1. 1.0 1.1 Liang Jin, Chen Li, and Sharad Mehrotra "Efficient Record Linkage in Large Data Sets"
  2. 2.0 2.1 Indrajit Bhattacharya, Lise Getoor "Iterative Record Linkage for Cleaning and Integration"

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

Some rights reserved
Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0