Tokenizing

Tokenization is the process of dividing a text into words, sentences, or other semantic units called tokens (Mitkov, 2004). The tokens can be used for further text processing. For example, a text can be automatically indexed in three steps. First, the text content is segmented into tokens. Second, the tokens are further processed into normalized tokens. The normalization process is used to match similar tokens despite differences in character sequences (Manning, Raghavan, & Schütze, 2008; Jurafsky & Martin, 2009). For example, lemmatization can be used to convert the three tokens am, are, and is to one token: be.¹ Third, the tokens are added to an inverted index data structure as indexing terms.

Word Segmentation

Tokenization usually divides text into words. In segmented languages, like English, words are delimited by punctuation and whitespace (Mitkov, 2004).² Thus, for these languages, the easiest tokenization method is to segment the text content on whitespace and trailing punctuation (Mitkov, 2004; Jurafsky & Martin, 2009).

Example 1: Input and output of a whitespace and trailing punctuation tokenizer.

Input:
Shoot all the bluejays you want, if you can hit ’em, but remember it’s a sin to kill a mockingbird.

Output:
Shoot all the bluejays you want if you can hit ’em but remember it’s a sin to kill a mockingbird

Sentence Segmentation

Some types of text processing need text to be tokenized into sentences (Mitkov, 2004). The easiest tokenization method is to divide on exclamation marks, periods, and question marks, because these usually signal the end of a sentence. However, there are cases were a period does not end a sentence, such as abbreviations. One of the most popular sentence segmentation tokenizers divides text on exclamation marks, periods, or question marks with a trailing whitespace character and capital letter (Jurafsky & Martin, 2009). This tokenizer can be further improved with a list of abbreviations (Jurafsky & Martin, 2009).

Challenges

Abbreviation

In segmented languages, a period trailing a word usually signals the end of the sentence. However, when a period follows an abbreviation it is part of the word and should be tokenized with it. For example, the abbreviation A.B.C. should be tokenized as A.B.C.; however, a whitespace and trailing punctuation tokenizer will tokenize the abbreviation as: A, B, and C. The most common approach to tokenizaing abbreviations is to use a list of abbreviations during tokenization (Jurafsky & Martin, 2009). In this case, when the tokenizer finds a word with a trailing period, it searches for the word in the abbreviations list. If the word is listed, the word and period are tokenized into one token.

Apostrophes

Apostrophes are often used for contractions (e.g., “aren’t”). One strategy is to divide on non-alphanumeric characters, but this can generate poor tokenizations (see Example 2). Some tokenizers can expand contractions, for example, from “aren’t” to are and not (Manning, Raghavan, & Schütze, 2008). Tokenizing words with apostrophes is further complicated because apostrophes are also used to mark possession and quotation. Tokenizing apostrophes is usually a language specific task (Manning, Raghavan, & Schütze, 2008). Thus, it requires the language of the text to be known³ and a set of language specific tokenizing rules (Manning, Raghavan, & Schütze, 2008).

Example 2: Input and output of a non-alphanumeric tokenizer.

Input:
aren’t

Output:
aren t

Hyphenation

Hyphenated words are difficult to tokenize – sometimes the hyphen is part of a token and sometimes it is not. The tokenizer must differentiate between end-of-line hyphens and true hyphens (Mitkov, 2004). End-of-line hyphens are used for breaking a word line, and should be removed during tokenization. True hyphens are part of tokens, and should not be removed. There are two types of true hyphens. The first contains certain prefixes and suffixes that are often written with a hyphen (Mitkov, 2004; Manning, Raghavan, & Schütze, 2008). The second contains multi-word tokens that are commonly written with a hyphen (Mitkov, 2004; Manning, Raghavan, & Schütze, 2008). It can be difficult to differentiate between an end-of-line hyphen and a true hyphen when a hyphen occurs at the end of a line. (Mitkov, 2004)

Multi-word, Numerical, and Special Expressions

Dividing on whitespace will incorrectly tokenize multi-word expressions. For example, “New York” should be tokenized as one token. However, it will be tokenized as New and York if the text is divided on whitespace (Manning, Raghavan, & Schütze, 2008). Incorrectly dividing on whitespace can cause poor search results. For example, a search for “York University” will return results for New York University. A multi-word dictionary is needed to correctly tokenize these multi-word expressions (Manning, Raghavan, & Schütze, 2008).

Numerical expressions that include punctuation, such as 1,234.56, should be tokenized as one token. However, if this example is divided on punctuation, it will be tokenized as 1, 234, and 56. Tokenizing numerical expressions is difficult because not all languages use the same numerical punctuation. For example, many European languages use commas and spaces where English uses periods and commas (e.g., 1 234,56).

There are many other types of special expressions that are difficult to tokenize, such as dates (January 1, 2000 and 01/01/2000) and telephone numbers (+1 (123) 456 7890). Many of these expressions follow a regular pattern, and are usually tokenized with a preprocessor or a regular expression (Mitkov, 2004).

Language-specific Tokenization

While most European languages have similar tokenization rules, there are also language-specific rules that help identify tokens without delimiters. For example, in Germanic languages compound nouns are written without whitespace between the component words, but still need to be segmented during tokenization. For example, Computerlinguistik – “computational linguistics” in German – needs to be tokenized into Computer and Linguistik (Manning, Raghavan, & Schütze, 2008). A compound divider can be used to divide compound words into the component words (Mitkov, 2004; Manning, Raghavan, & Schütze, 2008).

Bibliography

Ingersoll, G., Morton, T. & Farris, A. (2012). Taming text: How to find, organize, and manipulate it. Shelter Island, NY: Manning Publications Company.
Jurafsky, D. & Martin, J. (2009). Speech and language processing : An introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J: Pearson Prentice Hall.
Manning, C., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.
Mitkov, R. (2004). The Oxford handbook of computational linguistics. Oxford: Oxford University Press.

Footnotes

Other common techniques applied at the token level include: case alterations, stopword removal, expansion, part of speech tagging, and stemming. (Ingersoll, Morton, & Farris, 2012).
The tokenization of non-segmented languages, like Chinese, is an open problem and is beyond the scope of this entry.
Manning, Raghavan, and Schütze report, “Language identification based on classifiers that use short character subsequences as features is highly effective; most languages have distinctive signature patterns” (2008, p. 22).