Course:LIBR557/2020WT2/phonetic correction (soundex)

From UBC Wiki

Phonetic Correction (Soundex)

Phonetic correction is a search functionality that encodes English words to produce an index based on sound.1 By representing like-sounding words with the same code, the intent is to return search results even if there are no records that exactly match the query term(s) entered by the user.  Soundex, the oldest and most widely known phonetic algorithm, was initially applied to assist in name retrieval and analysis of the US Census (U.S. National Archives and Records Administration [NARA], 2007).  Though name retrieval remains the most relevant use, phonetic correction has been tested to address misspellings, typos, and variations in spelling across time (Rogers & Willett, 1991; Gadd, 1988).  Phonetic correction is a vital component of query understanding.  Without correction, errors at the input stage result in important information not being retrieved due to unsuccessful matching (Gadd, 1988).  Phonetic correction allows document retrieval of close alternatives and near misses.

Phonetic Algorithms

Phonetic correction occurs by creating an algorithm that encodes words to create a “dictionary” in which like-sounding words have the same code.  When a user inputs a misspelled query, this query term is also encoded and matched against the codes in the dictionary.  Matching these codes results in wider recall by retrieving documents that do not contain the exact spelling of the query. There are two main assumptions behind phonetic correction: The first assumption corresponds with “correction.” It is presumed that a word’s first letter and consonants are its most essential components and least susceptible to error (Rogers & Willett, 1991). With this assumption words can be represented by a code that captures only its most essential components. The second assumption of phonetic correction deals with the “phonetic” aspect.  Individual letters generally produce a distinct sound, but there are types of sounds that can describe groups of letters.  For example, the letters b, p, f, and v are labial consonants (articulated via the lips).  Phonetic coding schemes assume that one can improve information retrieval by mapping groups of letters to a particular number according to place of articulation (i.e. sound) (Rogers & Willett, 1991).

Soundex Algorithm

The Soundex code is a 4-character string that begins with the first letter of the term followed by three numbers ranging from 0 to 6. The digits 1 through 6 represent articulation placements and thus also denote similar sounds. Zero serves as a place holder in the event a term is not long enough for three digits (NARA, 2007).

Term values are assigned as follows (NARA, 2007):

  1. Retain the first letter; removes all vowels as well as all instances of w, h, and y
  2. Substitute digits for consonants (not included the first letter) as follows:
    • 1 → b, f, p, v
    • 2 → c, g, j, k, q, s, x, z
    • 3 → d, t
    • 4 → l
    • 5 → m, n
    • 6 → r
  3. If the word before coding has a double letter, condense this to one letter (e.g. in "Harris," the double r would be coded as one r).
  4. If a term has different letters side-by-side that also have the same code, treat these letters as one letter (e.g. in "Pickering," the k is ignored because it follows c which shares the same code).
  5. If a term has two same-code letters separated by a vowel, both consonants are coded (e.g. in "Harmon," the m and the n would both be coded)
  6. If a term has two same-code letters separated by an h or w, the second consonant is ignored. (e.g. in "Ashcroft," the c is ignored)
  7. If there are fewer than three digits after coding, add zeros until the term's code has three digits; if there are more than three digits after coding, only retain the first three digits.

Challenges

The Soundex algorithm includes a number of challenges that may result in a high number of false drops and unsuccessful matches.

Silent Consonants and Interchangeable Phonemes

Soundex is unable to handle silent consonants or interchangeable phonemes (Rogers & Willett, 1991).  “NEW” and “KNEW” are homophones, but Soundex would code them differently because the algorithm interprets the “K” in “KNEW” to be an essential phonetic component of the word.  Similarly, strings of consonants such as “PH” and “GHT” are phonetically interchangeable with “F” and “T,” respectively but are not recognized in the Soundex coding scheme. 

Diversity of Pronunciation

The English language borrows from many other languages.  As such, the same consonant might behave very differently in different words and contexts (Gadd, 1988).  “C” is a particularly difficult letter as it produces distinctly different sounds in “cat,” “cell,” and “cello.”

Short Words

Short words often present a large pool of potential matches that can impair search precision and cause frustration for the user. As Rogers and Willett (1991) note, "a mis-spelt four-letter word represents a much greater degree of disruption than a mis-spelt ten-letter word and the former is thus much more difficult to correct" (p. 350).

Variability of Language Over Time

Spelling, pronunciation, and word usage are constantly changing, but they do not change at the same rate.  Alterations to pronunciation typically occur more quickly than changes in spelling.  This variability poses problems to “defining an adequate set of substitutions for 'all occasions'” (Gadd, 1988, p. 234).

Further Research

Since the initial development of Soundex and phonetic correction, there have been many variations of phonetic algorithms.  A few of the variations are listed below.

  • New York State Identification and Intelligence System (NYSIIS)
  • Davidson Code
  • Daitch–Mokotoff Soundex (D–M Soundex)
  • Metaphone
  • Phonix

The development of the Levenshtein Distance in 1965 which focuses on edit distance rather than phonetics is a related method of difference matching in information retrieval.

Bibliography

Gadd, T. N. (1988). ‘Fisching fore werds’: Phonetic retrieval of written text in information systems. Program: Electronic Library and Information Systems, 22, 222–237.

Rogers, H. J., & Willett, P. (1991). Searching for historical word forms in text databases using spelling-correction methods: Reverse error and phonetic coding methods. Journal of Documentation, 47, 333–353.

U.S. National Archives and Records Administration. (2007, May 30). Soundex System. National Archives. https://www.archives.gov/research/census/soundex

Footnotes

  1. The scope of this project deals only with the English language. As such, the author has no knowledge of phonetic correction used for other languages.