Course:CPSC522/Identity Uncertainty in a restaurant data-set

Identity uncertainty in a restaurant data-set

In this page we will see if using dependencies among attributes in a restaurant data-set helps to solve the problem of "identity uncertainty".

Principal Author: Bahare Fatemi

Abstract

Deciding if two descriptions refer to the same object or not is the role of identity uncertainty. In this page, I have tried to solve the problem of identity uncertainty for a restaurant data-set. A probabilistic model has been proposed, which captures the dependencies among attributes. The probabilistic model consists of two independent Bayesian networks: one for hypothesis of descriptions refer to the same object and one for hypothesis of descriptions do not refer to the same object. The parameters of the networks have been set using prior knowledge or learned by using a part of the data-set as a train set. The accuracy of the proposed model has been computed and compared with a model with less dependencies. At the end, advantages and disadvantages of the proposed method are discussed.

Builds on

This page solves the problem of identity uncertainty using the dependencies among attributes (similar to this link) and models the structure of the problem as a Bayesian network.

Content

Background

Identity uncertainty is the problem of determining whether two descriptions refer to the same object or not. It is an important real world problem which can be applied to many different fields. As an example consider a data-set where may contain duplicate records. Deciding whether two records of this data-set correspond to the same object or not is the role of identity uncertainty. Identity uncertainty is a difficult task in real world data. Assume you have to decide whether two records about two customers having the same first and last name but different addresses refer to two different people. These two records may refer to the same person that has moved to a new address, or to two different people who happened to have the same first and last name.

Hypothesis

Each data-set has different attributes for each record. For example, in a data-set about people, attributes can be first name, last name, age, gender, and address of a person. These attributes can be dependent or independent of each other. Considering the assumption of these attributes being dependent or independent of each other changes our belief about the problem. Therefore, these assumptions changed the model of the problem. The hypothesis of this page is whether considering the dependencies among the attributes of records helps in solving the problem of identity uncertainty for a restaurant data-set.

Data-set

The hypothesis is applied to a restaurant data-set which is a collection of 864 restaurant records from the Fodor's and Zagat's restaurant guides that contains 112 duplicates. Attributes available for each record in this data-set are name, address, city, and type of the restaurants. This data has been provided by Dr. Sheila Tejada.

This data-set contains errors in different attributes. Errors may be typing errors, copying errors, and so on. Some of the restaurants may move to a new place. Moving causes two records to have the same name but different addresses. Even some types are sub-types of other types. For example, one restaurant can have the type of "Chinese" while the type is filled as "Asian". "Chinese" is a sub-type of "Asian" so it is not an error, but it makes the problem of identity uncertainty more difficult.

Modeling the problem

In the identity uncertainty problem, any two descriptions X and Y may or may not refer to the same object. Suppose ${\displaystyle Desc_{X}}$ and ${\displaystyle Desc_{Y}}$ denote the attribute values for restaurant X and restaurant Y. Let ${\displaystyle P_{same}(X,Y)}$ be the posterior probability that restaurant X and restaurant Y refer to the same object ${\displaystyle (X=Y)}$ and ${\displaystyle P_{notsame}(X,Y)}$ be the posterior probability that descriptions ${\displaystyle X}$ and ${\displaystyle Y}$ refer to different objects ${\displaystyle (X\neq Y)}$ given their attribute values. The odds for hypothesis ${\displaystyle X=Y}$ and ${\displaystyle X\neq Y}$ is as below:

${\displaystyle Odds={\frac {P_{same}}{P_{notsame}}}={\frac {P(X=Y)P(Desc_{X}\land Desc_{Y}|X=Y)}{P(X\neq Y)P(Desc_{X}\land Desc_{Y}|X\neq Y)}}}$

In the above equation ${\displaystyle {\frac {P(X=Y)}{P(X\neq Y)}}}$ is the prior odds and ${\displaystyle {\frac {P(Desc_{X}\land Desc_{Y}|X=Y)}{P(Desc_{X}\land Desc_{Y}|X\neq Y)}}}$ is the likelihood ratio. Therefore, for calculating the odds ratio, the prior ratio can be calculated based on the train-set or by using prior knowledge and only the likelihood ratio should be calculated. One approach for calculating the likelihood ratio is to model two separate Bayesian networks: one for hypothesis of ${\displaystyle X=Y}$ and one for hypothesis ${\displaystyle X\neq Y}$.

Structure learning for the X = Y hypothesis

For the X = Y hypothesis we should focus on the records that are different but refer to the same restaurant. In order to model the world of this problem, different hidden variables have been considered. One important hidden variable in the world of this problem is the errors. Errors can cause the same restaurant have two different descriptions. Consider these two records:

1) philippe's the original, 1001 n. alameda st., los angeles, american.

2) philippe the original, 1001 n. alameda st., chinatown, cafeterias.

These two records are referring to the same restaurant, but there is a typo in one of these restaurants names. So it is necessary to model the possible errors in this data-set.

As it was mentioned above, there is a probability of moving that causes the same restaurant to have two different addresses in two different descriptions.

In this Bayesian network we can see the dependencies among attributes.

This is the a Bayesian network modeling the restaurant data-set for hypothesis X = Y

In the Bayesian network above, nameX, addressX, cityX, and typeX are attributes of the record X and nameY, addressY, cityY, and typeY are attributes of record Y. sloppyX and sloppyY are two random variables indicating the probability of the typist of record X and record Y being sloppy respectively. EnameX and EnameY represent the probability of name attributes of records X and Y having errors. EaddressX, EaddressY, EcityX, EcityY, EtypeX, and EtypeY denote to different attributes of records X and Y having errors. The typist being sloppy is a parent of all different kinds of errors. Aname represent the actual first name with domain of all possible names of restaurant. nameX and nameY is dependent to what the actual name of the restaurant is and whether there is error in name fields or not. Aaddress, Aaddress, Acity, and Atype represent the actual address, actual city, and actual type respectively.

By inferencing in this Bayesian network, ${\displaystyle P(Desc_{X}\land Desc_{Y}|X=Y)}$ can be calculated.

Structure learning for the ${\displaystyle X\neq Y}$ hypothesis

For the ${\displaystyle X\neq Y}$ hypothesis, we should focus on the records which seemingly refer to the same restaurant, but indicate two different restaurants. Consider two different restaurants: they rarely have the same name, and the assumption of them being different decreases the probability of them having the same type or city. Assuming ${\displaystyle X\neq Y}$, each attribute of record X depends on the same attribute in record Y. The network below shows the Bayesian network for the ${\displaystyle X\neq Y}$ hypothesis.

This Bayesian network represents the dependencies among attributes of records with ${\displaystyle X\neq Y}$ hypothesis.

By inference in the above Bayesian network, ${\displaystyle P(Desc_{X}\land Desc_{Y}|X\neq Y)}$ can be calculated.

Parameter learning

After modeling the structure, the parameters should be learned. The parameters of these networks are the values in the conditional probability tables (CPTs). For learning the CPTs, the restaurant data-set has been split to 85% train set and 15% test-set. Then, the CPTs were learned based on the train set and prior knowledge.

Inference in the model

Recursive Conditioning

Recursive conditioning (RC) [1] is a search-based algorithm for computing the posterior probability of query variables given observation in a Bayesian network. It works by performing a case analysis on the variables of the Bayesian network, but recognizes disconnected components, caches the computations for future use, and evaluates and removes a CPT as soon as all its random variables are assigned a value. Algorithm 1 gives a high-level description of the RC algorithm for Bayesian network from here.

1: Procedure rc(Con : context, Fs : set of factors):
2:            if ∃v such that <<Con, Fs>, v> ∈ cache
3:                     return v
4:            else if vars(Con) ⊆ vars(Fs)
5:                     return rc({X = v ∈ Con : X ∈ vars(Fs)}, Fs)
6:            else if ∃F ∈ Fs such that vars(F) ⊆ vars(Con)
7:                     return eval(F, Con) × rc(Con, Fs \ {F})
8:            else if Fs = Fs1 ${\displaystyle \cup }$ Fs2 where vars(Fs1) ∩ vars(Fs2) ⊆ vars(Con)
9:                     return rc(Con, Fs1) × rc(Con, Fs2)
10:         else select variable X ∈ vars(Fs)
11:                   sum ← 0
12:                   for each v ∈ domain(X)
13:                           sum ← sum + rc(Con ∪ {X = v}, Fs)
14:                   cache ← cache ∪ {<<Con, Fs>,sum>}
15:                   return sum


Why recursive conditioning?

CPTs of the learned model are big. Therefore, choosing the inference method is important to get the queried probabilities faster. In the model learned for this problem, we have 2 independent hypotheses ${\displaystyle X=Y}$ and ${\displaystyle X\neq Y}$. The Bayesian network of hypothesis ${\displaystyle X\neq Y}$ has a few dependencies. Therefore, using any approach for inference in this Bayesian network is applicable. But for the Bayesian network of hypothesis ${\displaystyle X=Y}$ which has lots of random variables and dependencies, we should choose the inference approach wisely. As we can see in this Bayesian network, if we condition on sloppyX and sloppyY, we will have four independent sub-networks. One good approach is to use recursive conditioning in this Bayesian network. Using recursive conditioning, we can first branch on sloppyX and sloppyY and after that we have four separate sub-networks, for which we can find the probabilities independently. We can see the four different sub-networks in the figure below.

In this figure, we can see four independent sub-networks for attributes name, address, city, and type.

Therefore, recursive conditioning works efficient for the model of this problem.

Experimental results

Results of the proposed hypothesis have been compared with a similar approach that is as follows:

• This approach uses the two hypotheses of ${\displaystyle X=Y}$ and ${\displaystyle X\neq Y}$ for modelling.
• The same Bayesian network is used for the X = Y hypothesis.
• A Bayesian network with the same variables, but without the dependencies is used for the ${\displaystyle X\neq Y}$ hypothesis.

We can see the results of the proposed method and the compared method in table 1. In the case of identity uncertainty problem, true positive rate denotes the percentage of detecting descriptions that refer to some object and false negative rate denotes percentage of detecting descriptions that refer to different objects.

Table 1: experimental results
Proposed method Compared method
True positive 35% 0%
False negative 100% 100%

According to the table 1, the compared method learned the mode of the data and did not detect duplicate records at all, but our proposed method detected 35% of the duplicate records.

Discussions

The results show that using the dependencies among the attributes helps in solving the problem of identity uncertainty. But there are a few issues to take into account:

• In the learned model for this problem, two independent hypotheses have been considered and the problem has been modelled in two independent Bayesian networks, one for each hypothesis. But the two hypotheses might not be independent. Modelling the problem as one model, instead of splitting it into multiple independent models, may help us solve the problem more accurately. Furthermore, there are dependencies among multiple records which should be taken into account. As an example, if we are convinced that two descriptions X and Y refer to the same restaurant, and Y and Z also refer to the same restaurant, by transitivity we should increase the probability of X and Z referrnig to the same restaurant. These dependencies cannot be modelled using a Bayesian network. A Markov logic network (MLN) [2] might be one possible model that can potentially address both these issues. MLN applies the ideas of a Markov network to first-order logic and enables uncertain inference. They allow for collective classification [3] which enables deciding about multiple records at once, and using the decisions made for the other pairs of records when deciding about a specific pair.
• Splitting the data-set into the train set and test set usually does not work as good as using cross validation. But in this case I am learning the parameters by hand and doing that multiple times for multiple folds takes too much time. One good approach for learning the parameters is Expectation Maximization (EM). Using cross validation and EM might be a more reliable way of testing this hypothesis.
• In the model learned for the ${\displaystyle X\neq Y}$ hypothesis, there might be hidden variables that I did not consider in designing the model. For example one possible hidden variable is error. Considering other possible hidden variables may improve the method.