Course:LIBR557/2020WT2/edit distance

From UBC Wiki

Edit Distance

Edit distance is a metric used to measure how similar two strings are based on the number of operations necessary to transform one string into another (Jurafsky & Martin, 2009). There are different types of edit distance, but the most common operations used are insertion, deletion, and substitution (Ingersoll et al., 2012; Jurafsky & Martin, 2009; Manning et al., 2009).

Edit distance is primarily used in information retrieval to correct spelling errors in user queries (Manning et al., 2008). Users may misspell queries for a variety of reasons (e.g., typing errors, spelling errors, cognitive errors), and so, it is useful for the search system to allow for user error by retrieving the given word and words that are similar to the user query (Navarro, 2011). Ideally, when looking at alternative correct spellings for a misspelled query, the system should select the nearest or most similar one (Manning et al., 2008). In text retrieval, edit distance is the generally accepted model used to measure the similarity between words (Navarro, 2011).

Edit distance is also used in computational biology to measure the similarity of DNA and protein sequences (Navarro, 2011).1

Edit Distance Operations

  • Insertion involves inserting a new character into a string. For example, transforming “fmily” into t “family” requires the insertion of an “a”.
  • Deletion involves removing a character from a string. For example, transforming “Vancjouver” into “Vancouver” requires the deletion of the “j”.
  • Substitution involves replacing one character in a string with another. For example, transforming “chocken” into “chicken” requires substituting the “o” with an “i”.

Levenshtein Distance

The simplest and most common form of edit distance allows only insertion, deletion, and substitution and is known as the Levenshtein distance (Manning et al., 2008). The Levenshtein distance is sum of the operations required to transform one string into another, with each operation given a weight of one (Ingersoll et al., 2012).

Example: Levenshtein distance between “intention” and “execution”.
  1. intention → ntention (delete “i”)
  2. ntention → etention (substitutue “n” with “e”)
  3. etention → exention (substitute “t” with “x”)
  4. exention → exection (substitute “n” with “c”)
  5. exection → execution (insert “u”)

The Levenshtein distance between “intention” and “execution” is five (Jurafsky & Martin, 2009).

Other Types of Edit Distance  

Damerau-Levenshtein Distance

Similar to the Levenshtein distance, the Damerau-Levenshtein distance allows insertion, deletion, substitution, but also adds transposition (Ingersoll et al., 2012). Transposition involves swapping adjacent characters in a source string to make it more similar the target string. For example, transforming “suop” into “soup” requires transposing “uo” into “ou”. Each operation is given a weight of one, and transposition can be used on adjacent letters with a weight of one edit instead of requiring two operations (Ingersoll et al., 2012).

Hamming Distance

The Hamming distance is only used with strings of equal length and allows only substitution, with each operation given a weight of one (Navarro, 2001). The Hamming distance is the minimal number of substitutions required to transform one string into another of equal length (Bookstein et al., 2002). Within information retrieval, the Hamming distance can be used to quantify the differences in the occurrence patterns of terms in documents (Bookstein et al., 2002).

Challenges

Minimum Edit Distance

When transforming a source string into the target string, there are many possible sequences of edits (Jurafsky & Martin, 2009). Generally, it is best to identify the path requiring the fewest operations to transform one string into another, the minimum edit distance (Jurafsky & Martin, 2009). If all possible paths between two strings were computed to find the minimum edit distance, this task would be incredibly computationally expensive (Jurafsky & Martin, 2009). So, instead of computing all possible paths each time, it is possible to recall the shortest path between two strings using dynamic programming (Jurafsky & Martin, 2009). Dynamic programming allows for large problems to be solved by combining solutions to sub-problems (Jurafsky & Martin, 2009).

Weighting Edits

In the simplest form of edit distance, all operations are assigned a weight of one. However, there may be times when it is useful to assign different weights to different operations (Manning et al., 2008). In spelling correction, the weight of an operation may depend on the characters involved (Manning et al., 2008). For example, when typing, it is more likely that “r” will be mistyped as “t” than as “p”, because “t” is closer to “r” on the keyboard (Manning et al., 2008). Given this likelihood, a lower weight would be assigned to replacing “t” with “r” than with “p” (Manning et al., 2008).

Areas for Future Work

Phonetic Correction

Phonetic correction involves identifying misspellings that occur as a result of the user imputing a query that sounds like the target term (Manning et al., 2008). These misspellings are generally identified using a language model to measure misspellings independent of acoustics, and an acoustic model to measure the likelihood of a given term’s expected pronunciation (Droppo & Acero, 2010). Research by Droppo and Acero (2010) proposes the addition of a phonetic string edit distance to the language and acoustic model scores in phonetic correction.

Query Suggestion

Query suggestion aims to improve the efficiency of a user’s search by providing suggestions to complete the user’s query (Qin et al., 2020). Research by Qin et al. (2020) explores using edit distance constraints to improve user error tolerance in query autocompletion. Their method performed significantly better than existing solutions in terms of query response time (Qin et al., 2020).

Regular Expressions

A regular expression is a pattern made up of characters used to specify text search strings (Jurafsky & Martin, 2009). Regular expressions are used to search a document for all strings that match the pattern (Jurafsky & Martin, 2009). Research by Belazzougui & Raffinot (2013) explores the use of edit distance in regular expression matching to allow for errors when searching for a given regular expression in a document.

Bibliography

Belazzougui, D., & Raffinot, M. (2013). Approximate regular expression matching with multi-strings. Journal of Discrete Algorithms, 18, 14–21. https://doi.org/10.1016/j.jda.2012.07.008
Bookstein, A., Kulyukin, V. A., & Raita, T. (2002). Generalized Hamming distance. Information Retrieval, 5, 353-375. https://doi.org/10.1023/A:1020499411651
Droppo, J., & Acero, A. (2010). Context dependent phonetic string edit distance for automatic speech recognition. 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 4358-4361. https://doi.org/10.1109/ICASSP.2010.5495652
Ingersoll, G., Morton, T., & Farris, A. (2012). Taming text: How to find, organize, and manipulate it. Manning Publications Company.
Jurafsky, D. & Martin, J. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (2nd ed.). Pearson Prentice Hall.
Manning, C., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1), 31-88. https://doi.org/10.1145/375360.375365
Navarro, G. (2011). Queries: Language & properties. In R. Baeza-Yates & B. Ribeiro-Neto (Eds.), Modern information retrieval: The concepts and technology behind search (pp. 255-280). Addison Wesley.
Qin, J., Xiao, C., Hu, S., Zhang, J., Wang, W., Ishikawa, Y., Tsuda, K., & Sadakane, K. (2020). Efficient query autocompletion with edit distance-based error tolerance. The VLDB Journal, 29(4), 919–943. https://doi.org/10.1007/s00778-019-00595-4

Footnotes

  1. Further details of applications in computational biology are outside of the scope of this entry that focuses on informational retrieval and natural language processing.