Course:CPSC522/Identity Uncertainty in a restaurant dataset
Contents
Identity uncertainty in a restaurant dataset
In this page we will see if using dependencies among attributes in a restaurant dataset helps to solve the problem of "identity uncertainty".
Principal Author: Bahare Fatemi
Abstract
Deciding if two descriptions refer to the same object or not is the role of identity uncertainty. In this page, I have tried to solve the problem of identity uncertainty for a restaurant dataset. A probabilistic model has been proposed, which captures the dependencies among attributes. The probabilistic model consists of two independent Bayesian networks: one for hypothesis of descriptions refer to the same object and one for hypothesis of descriptions do not refer to the same object. The parameters of the networks have been set using prior knowledge or learned by using a part of the dataset as a train set. The accuracy of the proposed model has been computed and compared with a model with less dependencies. At the end, advantages and disadvantages of the proposed method are discussed.
Builds on
This page solves the problem of identity uncertainty using the dependencies among attributes (similar to this link) and models the structure of the problem as a Bayesian network.
Content
Background
Identity uncertainty is the problem of determining whether two descriptions refer to the same object or not. It is an important real world problem which can be applied to many different fields. As an example consider a dataset where may contain duplicate records. Deciding whether two records of this dataset correspond to the same object or not is the role of identity uncertainty. Identity uncertainty is a difficult task in real world data. Assume you have to decide whether two records about two customers having the same first and last name but different addresses refer to two different people. These two records may refer to the same person that has moved to a new address, or to two different people who happened to have the same first and last name.
Hypothesis
Each dataset has different attributes for each record. For example, in a dataset about people, attributes can be first name, last name, age, gender, and address of a person. These attributes can be dependent or independent of each other. Considering the assumption of these attributes being dependent or independent of each other changes our belief about the problem. Therefore, these assumptions changed the model of the problem. The hypothesis of this page is whether considering the dependencies among the attributes of records helps in solving the problem of identity uncertainty for a restaurant dataset.
Dataset
The hypothesis is applied to a restaurant dataset which is a collection of 864 restaurant records from the Fodor's and Zagat's restaurant guides that contains 112 duplicates. Attributes available for each record in this dataset are name, address, city, and type of the restaurants. This data has been provided by Dr. Sheila Tejada.
This dataset contains errors in different attributes. Errors may be typing errors, copying errors, and so on. Some of the restaurants may move to a new place. Moving causes two records to have the same name but different addresses. Even some types are subtypes of other types. For example, one restaurant can have the type of "Chinese" while the type is filled as "Asian". "Chinese" is a subtype of "Asian" so it is not an error, but it makes the problem of identity uncertainty more difficult.
Modeling the problem
In the identity uncertainty problem, any two descriptions X and Y may or may not refer to the same object. Suppose and denote the attribute values for restaurant X and restaurant Y. Let be the posterior probability that restaurant X and restaurant Y refer to the same object and be the posterior probability that descriptions and refer to different objects given their attribute values. The odds for hypothesis and is as below:
In the above equation is the prior odds and is the likelihood ratio. Therefore, for calculating the odds ratio, the prior ratio can be calculated based on the trainset or by using prior knowledge and only the likelihood ratio should be calculated. One approach for calculating the likelihood ratio is to model two separate Bayesian networks: one for hypothesis of and one for hypothesis .
Structure learning for the X = Y hypothesis
For the X = Y hypothesis we should focus on the records that are different but refer to the same restaurant. In order to model the world of this problem, different hidden variables have been considered. One important hidden variable in the world of this problem is the errors. Errors can cause the same restaurant have two different descriptions. Consider these two records:
1) philippe's the original, 1001 n. alameda st., los angeles, american.
2) philippe the original, 1001 n. alameda st., chinatown, cafeterias.
These two records are referring to the same restaurant, but there is a typo in one of these restaurants names. So it is necessary to model the possible errors in this dataset.
As it was mentioned above, there is a probability of moving that causes the same restaurant to have two different addresses in two different descriptions.
In this Bayesian network we can see the dependencies among attributes.
In the Bayesian network above, nameX, addressX, cityX, and typeX are attributes of the record X and nameY, addressY, cityY, and typeY are attributes of record Y. sloppyX and sloppyY are two random variables indicating the probability of the typist of record X and record Y being sloppy respectively. EnameX and EnameY represent the probability of name attributes of records X and Y having errors. EaddressX, EaddressY, EcityX, EcityY, EtypeX, and EtypeY denote to different attributes of records X and Y having errors. The typist being sloppy is a parent of all different kinds of errors. Aname represent the actual first name with domain of all possible names of restaurant. nameX and nameY is dependent to what the actual name of the restaurant is and whether there is error in name fields or not. Aaddress, Aaddress, Acity, and Atype represent the actual address, actual city, and actual type respectively.
By inferencing in this Bayesian network, can be calculated.
Structure learning for the hypothesis
For the hypothesis, we should focus on the records which seemingly refer to the same restaurant, but indicate two different restaurants. Consider two different restaurants: they rarely have the same name, and the assumption of them being different decreases the probability of them having the same type or city. Assuming , each attribute of record X depends on the same attribute in record Y. The network below shows the Bayesian network for the hypothesis.
By inference in the above Bayesian network, can be calculated.
Parameter learning
After modeling the structure, the parameters should be learned. The parameters of these networks are the values in the conditional probability tables (CPTs). For learning the CPTs, the restaurant dataset has been split to 85% train set and 15% testset. Then, the CPTs were learned based on the train set and prior knowledge.
Inference in the model
Recursive Conditioning
Recursive conditioning (RC) ^{[1]} is a searchbased algorithm for computing the posterior probability of query variables given observation in a Bayesian network. It works by performing a case analysis on the variables of the Bayesian network, but recognizes disconnected components, caches the computations for future use, and evaluates and removes a CPT as soon as all its random variables are assigned a value. Algorithm 1 gives a highlevel description of the RC algorithm for Bayesian network from here.
1: Procedure rc(Con : context, Fs : set of factors):
2: if ∃v such that <<Con, Fs>, v> ∈ cache 3: return v 4: else if vars(Con) ⊆ vars(Fs) 5: return rc({X = v ∈ Con : X ∈ vars(Fs)}, Fs) 6: else if ∃F ∈ Fs such that vars(F) ⊆ vars(Con) 7: return eval(F, Con) × rc(Con, Fs \ {F}) 8: else if Fs = Fs1 Fs2 where vars(Fs1) ∩ vars(Fs2) ⊆ vars(Con) 9: return rc(Con, Fs1) × rc(Con, Fs2) 10: else select variable X ∈ vars(Fs) 11: sum ← 0 12: for each v ∈ domain(X) 13: sum ← sum + rc(Con ∪ {X = v}, Fs) 14: cache ← cache ∪ {<<Con, Fs>,sum>} 15: return sum
Why recursive conditioning?
CPTs of the learned model are big. Therefore, choosing the inference method is important to get the queried probabilities faster. In the model learned for this problem, we have 2 independent hypotheses and . The Bayesian network of hypothesis has a few dependencies. Therefore, using any approach for inference in this Bayesian network is applicable. But for the Bayesian network of hypothesis which has lots of random variables and dependencies, we should choose the inference approach wisely. As we can see in this Bayesian network, if we condition on sloppyX and sloppyY, we will have four independent subnetworks. One good approach is to use recursive conditioning in this Bayesian network. Using recursive conditioning, we can first branch on sloppyX and sloppyY and after that we have four separate subnetworks, for which we can find the probabilities independently. We can see the four different subnetworks in the figure below.
Therefore, recursive conditioning works efficient for the model of this problem.
Experimental results
Results of the proposed hypothesis have been compared with a similar approach that is as follows:
 This approach uses the two hypotheses of and for modelling.
 The same Bayesian network is used for the X = Y hypothesis.
 A Bayesian network with the same variables, but without the dependencies is used for the hypothesis.
We can see the results of the proposed method and the compared method in table 1. In the case of identity uncertainty problem, true positive rate denotes the percentage of detecting descriptions that refer to some object and false negative rate denotes percentage of detecting descriptions that refer to different objects.
Proposed method  Compared method  

True positive  35%  0% 
False negative  100%  100% 
According to the table 1, the compared method learned the mode of the data and did not detect duplicate records at all, but our proposed method detected 35% of the duplicate records.
Discussions
The results show that using the dependencies among the attributes helps in solving the problem of identity uncertainty. But there are a few issues to take into account:
 In the learned model for this problem, two independent hypotheses have been considered and the problem has been modelled in two independent Bayesian networks, one for each hypothesis. But the two hypotheses might not be independent. Modelling the problem as one model, instead of splitting it into multiple independent models, may help us solve the problem more accurately. Furthermore, there are dependencies among multiple records which should be taken into account. As an example, if we are convinced that two descriptions X and Y refer to the same restaurant, and Y and Z also refer to the same restaurant, by transitivity we should increase the probability of X and Z referrnig to the same restaurant. These dependencies cannot be modelled using a Bayesian network. A Markov logic network (MLN) ^{[2]} might be one possible model that can potentially address both these issues. MLN applies the ideas of a Markov network to firstorder logic and enables uncertain inference. They allow for collective classification ^{[3]} which enables deciding about multiple records at once, and using the decisions made for the other pairs of records when deciding about a specific pair.
 Splitting the dataset into the train set and test set usually does not work as good as using cross validation. But in this case I am learning the parameters by hand and doing that multiple times for multiple folds takes too much time. One good approach for learning the parameters is Expectation Maximization (EM). Using cross validation and EM might be a more reliable way of testing this hypothesis.
 In the model learned for the hypothesis, there might be hidden variables that I did not consider in designing the model. For example one possible hidden variable is error. Considering other possible hidden variables may improve the method.
Annotated Bibliography
 ↑ Darwiche, Adnan. "Recursive conditioning." Artificial Intelligence 126.1 (2001): 541."
 ↑ Richardson, Matthew, and Pedro Domingos"Markov logic networks." Machine learning 62.12 (2006): 107136.
 ↑ Sen, Prithviraj, et al. "Collective classification in network data." AI magazine 29.3 (2008): 93.APA
To Add
Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.
