Course:CPSC522/Linking Sentences in Asynchronous Conversations
Author: Jordon Johnson
Linking Sentences in Asynchronous Conversations
In asynchronous conversations, linking sentences is the process of determining, given a sentence, the earlier sentence(s) in the conversation to which it refers. Of interest to researchers is finding either a single metric or a set of features that can be used to predict which sentences in a given conversation are linked.
Principal Author: Jordon Johnson
Abstract
This page gives a brief introduction to asynchronous conversations; in addition, the concept of linking sentences is introduced, and some potential uses of linking sentences are discussed. A brief introduction to similarity measures of vector representations of text is also given. The hypothesis considered is that the cosine similarity performs better than random label assignment in classifying linked sentences, same-topic sentences, and random pairings of sentences. It is found that the cosine similarity does outperform random assignment in distinguishing between linked and random sentence pairs, as well as between same-topic and random sentence pairs; but it performs comparably to random assignment when distinguishing between linked and same-topic sentence pairs.
Builds on
Linking sentences is a task in the field of Natural Language Processing (NLP), which field often makes use of Machine Learning techniques. This page makes use of similarity measures such as cosine similarity and the vector representations of words known as word embeddings; and it makes references to subfields of NLP such as sentiment analysis. In addition, the page makes use of the Receiver Operating Characteristic (ROC) curve in the evaluation of the hypothesis.
Content
Asynchronous Conversations
If you frequent websites such as Reddit or Slashdot, you probably already have a fairly good idea what an asynchronous conversation is, even if you're unfamiliar with the term. In synchronous conversations, such as one would have face-to-face or in meetings, speakers take turns; and the conversation follows a single thread that may jump between topics but remains unbroken in time. Asynchronous conversations, however, can fracture into tree structures, each branch of which is its own conversation; a participant (or poster) can reply to any existing post (article or comment) at any time. Thus the structure of asynchronous conversations is determined by reply-to relations, and any temporal ordering of posts is constrained to fit that structure. A simple example is given in Figure 1. Because of the structural differences between synchronous and asynchronous conversations, techniques used to analyse synchronous conversations aren't always as effective when applied to asynchronous conversations.
Due to their continuing proliferation online, the exploration and analysis of asynchronous conversations are of significant interest to researchers. The NLP group at UBC has made a number of contributions to this field; one example is ConVis, a visualization system to facilitate the exploration of topics, sentiment, and poster activity in asynchronous conversations.
Linked Sentences
Consider some sentence in a comment (let's call it com_sen) in an asynchronous conversation. This comment is either a reply to the root post of the conversation (such as an article), or it is in reply to another ancestor comment in its branch of the conversation tree structure. Suppose there is a particular sentence in one of these ancestors (let's call the sentence art_sen) to which com_sen specifically refers. We could then say that art_sen and com_sen are linked[1].
Figure 2a takes the simple asynchronous conversation from Figure 1 and shows the links between its sentences. Four aspects of these links are of particular interest:
- The links are directional; com_sen always links to art_sen, because a sentence cannot refer to something that hasn't been written yet.
- Links should be from descendent posts to ancestor posts. Due to human error, there may be exceptions (such as links between sentences in sibling posts).
- A given com_sen may link to multiple sentences in different ancestor posts, though we can require that at most one sentence per post may be linked to that com_sen.
- Asynchronous conversations can cover multiple topics; so one would expect that, if two sentences are linked, they are likely about the same topic.
Once a link between two sentences is found, additional analysis can be performed on the linked sentences[1]. For example, the sentiment of art_sen could be determined and applied to the link; in other words, we could determine how the poster feels about art_sen. Figure 2b shows an alternate representation of links with such additional information.
Finding links between sentences is of particular use in the summarization of asynchronous conversations[1]. If a particular sentence in an article is linked to a relatively large number of comment sentences, then that sentence would likely be included in a summary of the article; and an aggregation of additional information associated with the links (such as sentiment or agreement/disagreement) could also be included. For example, a summary of the simple conversation in Figures 1 and 2 might be something like
- Not everyone agrees that dogs are smarter than cats.
A second use of links in summarization is related to the concept of link chains, where the art_sen of one link is the com_sen of another. If link chains can be categorized by patterns in sentiment and argument values, and if most of the relevant sentences also link to the same sentence or group of sentences elsewhere in the conversation, then it may be possible to "collapse" entire branches into a single sentence in the summary of the conversation.
Similarity Measures and Vector Representations of Text
A commonly-used measure of similarity is cosine similarity, where the inner product and magnitude of two vectors are exploited. Cosine similarity is used in NLP as well, often as a simple baseline for how similar two text samples are to each other. Since cosine similarity requires vector inputs, the question becomes how to extract vectors from text.
A simple technique is as follows: consider a collection of documents , where the set of distinct words found in the collection is . Assign a -dimensional vector to each document in , where the th element in the vector is the number of times word appears in the document.
Unfortunately, the simple technique described doesn't always work. Consider the following two sentences:
- I love pie.
- Cake is my favourite food.
We would say that these sentences are quite similar, since they both talk about preferred foods, and pie and cake are quite similar as well. However, since these two sentences have no words in common, under the method described above their cosine similarity would be zero.
As we have seen, this technique would be less effective for small text samples such as sentences (due to an increased likelihood of no common words between two samples), and so we need other techniques. The technique used here is based on word embeddings, which are vectors representing individual words. Word embeddings can be learned in various ways, and the dimensions of the vectors represent semantic features. A vector representation of a sentence can then be obtained by summing the vectors of its words.
Hypothesis
In distinguishing between the following "classes" of sentence pairs:
- linked sentences and random sentence pairs
- pairs of sentences about the same topic (same-topic sentences) and random sentence pairs
- linked sentences and same-topic sentences
classification based on cosine similarity is more accurate than assigning labels randomly.
Data
Three datasets were used:
- Ten articles from The Guardian with associated comments. Each article was annotated with links between sentences as part of the OnForumS (Online Forum Summarization)[1] task. It is not guaranteed that the set of annotated links is complete; and so I choose not to make the assumption that all other sentence pairs must not be linked.
- Nineteen blog conversations from Slashdot, forming part of UBC's BC3 corpus. Each conversation was annotated with topic information by three different annotators. I used versions of the data that had been reformatted earlier to make the relevant topic information more easily accessible.
- The Brown corpus, available for download via the NLTK package in Python. The first computer-readable corpus, it contains a variety of texts from both fictional and non-fictional genres.
Procedure
For each of the three datasets, a set of sentence pairs corresponding to one of the classes was extracted:
- Pairs of linked sentences were extracted from the OnForumS dataset, following the annotations in the data. ~2300 pairs of linked sentences were extracted in total.
- Pairs of sentences in the same topic were extracted from the BC3 dataset. Since the sentences were annotated by three different annotators, a pair of sentences was only considered to be in the same topic if they were so annotated by all three annotators. Of these same-topic sentences, ~2300 were randomly selected in order to keep sample sizes similar.
- ~2300 pairs of sentences were randomly selected from the Brown corpus.
For each sentence pair thus extracted, the cosine similarity was calculated. The Python package gensim includes an implementation of functions related to Word2Vec word embedding models; and the included n_similarity function returns the cosine similarity of two input sets of words (in this case, the words of the sentences in each pair). A pre-trained model provided by Google and containing 300-dimensional vectors of ~3M words was used. Since the n_similarity function fails if it is supplied words that are not found in the model's vocabulary, any such words were removed from the sentences before calculating the cosine similarity.
The output of the above procedure is three sets of cosine similarity values. By imposing a threshold value as a classifier and varying that threshold, the true positive rate and false positive rate can be calculated at each threshold value, and an ROC curve can be constructed. In this case, the cosine similarity threshold was increased by increments of 0.01 over the range [0,1]. The ROC curves were constructed and the maximum classifier accuracy estimated over three runs for each class comparison (see Figure 4 and Table 1).
As another means of visualizing the comparisons between classes (and as a sanity check), each set of cosine similarities was plotted as a histogram. The distributions were then compared, as shown in Figure 3. Based on the results, we expect that the cosine similarity will likely be more useful for distinguishing between linked or same-topic sentences from random sentence pairs than it would for distinguishing between linked and same-topic sentences.
Results
Classification | Threshold | Accuracy |
linked vs random | 0.563 +/- 0.009 | 0.644 +/- 0.003 |
same-topic vs random | 0.620 +/- 0.014 | 0.646 +/- 0.004 |
Based on the results seen in Figure 4, I conclude that classification using cosine similarity does indeed outperform random label assignment when classifying either linked or same-topic sentences from random sentence pairs; but that using cosine similarity is only comparable to random assignment when trying to distinguish between linked sentences and same-topic sentence pairs. This result is as expected based on the "sanity check" results in Figure 3.
Discussion
You may note that the classes being compared are not necessarily disjoint; for example, a random pair of sentences may also be linked, and we expect most (if not all) linked sentences to be about the same topic. While it would be ideal to work with strictly disjoint classes, the data currently available to me (as discussed earlier) does not permit it. In any case, the listed comparisons are still useful; for example, since some of the random sentence pairs may be linked, the accuracy of the classification of linked vs. strictly not-linked random sentence pairs should be at least as good as the results obtained. In other words, these comparisons should give us a lower bound on the accuracy of the cosine similarity as a classifier.
While the performance of the cosine similarity as a classifier is better than random assignment in two of the three hypothesized cases, there is definitely room for improvement in distinguishing linked sentences from all other sentence pairs. A natural conclusion to draw would be that the cosine similarity could be a significant feature in some larger set of features applied to some learned classifier.
References
To Add
Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.
|