Course:LIBR557/2020WT2/regular expressions
Regular Expressions
Regular expressions, sometimes also called "regex" or "regexp", are specially encoded text strings used as patterns for matching sets of strings (Fitzgerald, St. Laurent, & Demarest, 2012). They follow a simple syntax but deliver great expressive power. Stephen Cole Kleene, a mathematician, invented regular expressions in the mid-1950s (Oram & Wilson, 2007), and Kenneth Lane Thompson, a computer scientist, introduced Kleene’s invention to the computer science community in the mid-1960s (Thompson, 1968). Since then, regular expressions have served as the input language for many systems that process strings to facilitate searching (Hopcroft, Motwani, & Ullman, 2000; Sipser, 2006). Many applications such as text editors, and modern programming languages like Python and JavaScript, support regular expression as a means to enable advanced matching and querying (Nield, 2019). Compilers of conventional programming languages often perform regular expression matching for tokenization (lexical analysis) of a given source file (Aho, Lam, Sethi, & Ullman, 2007).
Basic Regular Expressions
Regular Expressions (REs) are the algebraic description of languages (Hopcroft et al., 2000; Sipser, 2006). The following are the fundamental regular expressions that can not be broken apart, and the advanced regular expressions are derived from the fundamentals (See fundamental operators and derived operators below).
Fundamental Operators on Regular Expressions
If r1 and r2 are regular expressions, here are three fundamental operators on them (Hopcroft et al., 2000; Sipser, 2006).
REs | Language of REs | Description | Precedence | Example |
---|---|---|---|---|
r1r2 | L(r1)L(r2) | Concatenation: the language generated by r1 is followed by the language generated by r2 | middle | r1r2 |
r1|r2 | L(r1) ∪ L(r2) | Combination/Union of r1 and r2: same function of Boolean operator – OR without the duplication | lowest | r1, r2 |
r1* | L(r1)* | Kleene closure: 0 or more occurrences of r1 | highest | ε, r1, r1r1, r1r1r1, … |
Derived Operators on Regular Expressions
The above three fundamental operators are important but may cause users inconvenience when writing some complex regular expressions. Derived operators (Aho, Lam, Sethi, & Ullman, 2007) are thus included to allow users to write certain regular repressions more efficiently. It is worth noting that the introduction of derived operators does not increase the same expressive power of regular expressions (see challenges below).
REs | Description | Example | ||
---|---|---|---|---|
Regular Expression with Fundamental Operator | Regular Expression with Derived Operator | Highlighting Possible Match | ||
[ ] | square expression matches a given character enclosed in the brackets | reali(x|y|z)e | reali[xyz]e | realize |
[ - ] | square expression matches a character in the given range | reali(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)e | reali[a-z]e | realise |
? | question mark matches 0 or 1 occurrence of the preceding RE | diarrh(o|)ea | diarrho?ea | diarrhea! |
+ | plus sign matches 1 or more occurrences of the preceding RE | (cccc*) | ccc+ | ccccd |
Further Enhancement: Programming-Language Specific Regular Expressions
To provide greater convenience for users to write REs in the search field, the programming languages tailor-make their own REs and shorthand character classes of the same expressiveness based on the fundamental REs. The following are part of the enhancements for the Python programming language (Stubblebine, 2007).
Meta-character | Description | Example | ||
---|---|---|---|---|
Regular Expression with Basic and Derived Operator | Regular Expression with Metacharacter | Highlighting Possible Match | ||
[^ ] | caret matches a character does not exist in the brackets | N.A. | [^abc] | abcdcba |
\ | \d matches a whole number | [0-9][0-9][0-9][0-9] | \d\d\d\d | 2021 year |
\w matches a word character, i.e., the set of alphanumeric characters and underscore | [a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_] | \w\w\w | Yo3? | |
\W matches a non-word character | [^a-zA-Z0-9_][^a-zA-Z0-9_][^a-zA-Z0-9_] | \W\W\W | ^-^ | |
. | period matches a character (except for line terminators \n) | pig[^a-zA-Z0-9_]|pig[a-zA-Z0-9_] | pig. | pig@s |
^ | caret matches a character at the beginning of the line or string | N.A. | ^ab | abc |
$ | dollar sign matches a character at the end of the line or string | N.A. | ig$ | pig |
{n1} | curly brackets specify the preceding item is matched exactly n1 time(s) | N.A. | z{3} | xyzzz |
{n1,} | curly brackets specify the preceding item is matched n1 or more times | N.A. | z{2,} | xyzzzz |
{n1,n2} | curly brackets specify the preceding item is matched from n1 to n2 times | N.A. | z{2,3} | xyzz |
More Complicated Examples of Programming-Language Specific Regular Expressions
Users can test the regular expressions on free online platforms such as Regular Expressions 101 and RegExr. The following REs are entered in the search box to find the matches in the content of “About the iSchool” (Sample).
REs | Highlighting Sample Match |
---|---|
\W@\d\W\d*\s\w* | @1:35 pm |
^U\w*|\d\d\d\d | UBC |
2020 | |
\w*o{2}\w.$ | iSchool. |
[A-Z]+[^\w][A-Z][a-z]* | UBC School |
[S-Z]\w+[s-z]s? | Studies |
University | |
Times | |
Scholarships |
Challenges
Non-Regular Expressions
Regular expressions are powerful but can not describe and express everything in the world. To increase the expressiveness, some systems and programming languages allow their so-called "regular expressions" in the querying field to be context-free (Lewis & Lacher, 2017; Norvell, 2002) or even context-sensitive (Stubblebine, 2007), blurring the definition of REs and causing confusion to users. Users are likely to understand the "regular expressions" by the first programming language they first learned or the system they first came across. Meanwhile, they may not notice that there is a trade-off between the computing power of the search engines and the expressiveness of these "Regular Expressions".
Non-Transferable Skills
Along the same line, programming languages such as Python and JavaScript have established their specific regular expressions. Due to the non-identical syntax and supporting features of the regular expressions among the programming languages and systems, users can not fully apply and transfer what they learned of REs in Python to JavaScript to retrieve search results (Stubblebine, 2007). They also do not understand why the regular expression flavors share the same basic but also different advanced features. Sometimes, they may even not be aware of the subtle differences across various flavors of "regular expressions".
To find any email addresses in a given text, here are the regular expressions for different implementations1,2. The highlighted differences show that both languages have different escape characters and sequences.
Python | JavaScript |
---|---|
((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t]*\r\n)?[\t ]+)?\"(((([\t ]*\r\n)?[\t ]+)?([]!#-[^-~]|(\\[\t -~])))+(([\t ]*\r\n)?[\t]+)?|(([\t ]*\r\n)?[\t ]+)?)\"(([\t ]*\r\n)?[\t ]+)?)@((([\t ]*\r\n)?[\t]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?\[((([\t]*\r\n)?[\t ]+)?[!-Z^-~])*(([\t ]*\r\n)?[\t ]+)?](([\t ]*\r\n)?[\t ]+)?) | ((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t]*\r\n)?[\t ]+)?"(((([\t ]*\r\n)?[\t ]+)?([]!#-[^-~]|(\\[\t -~])))+(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t]+)?)"(([\t ]*\r\n)?[\t ]+)?)@((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?\[((([\t ]*\r\n)?[\t ]+)?[!-Z^-~])*(([\t ]*\r\n)?[\t ]+)?](([\t]*\r\n)?[\t ]+)?) |
^[\w\.]+@([\w-]+\.)+[\w-]{2,4}$ | ^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$ |
Unicode Support
Unicode is very important as it allows users to perform computing tasks in their native languages. Unfortunately, most RE engines from common programming languages only support ASCII characters and offer limited/no support to Unicode standards (e.g., UTF-8).
Bibliography
- Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2007). Compilers: Principles, techniques, and tools (2nd ed.)
- Fitzgerald, M., St. Laurent, S., & Demarest, R. (2012). Introducing regular expressions (1st ed.). Sebastopol, CA: O'Reilly.
- Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2000). Introduction to automata theory, languages, and computation (2nd ed.) Addison Wesley.
- Lewis, M. C., & Lacher, L. (2017). Regular expressions and context-free parsers. (pp. 549-568) Chapman and Hall/CRC.
- Nield, T. (2019). Introduction to regular expressions (1st ed.) O'Reilly Media, Inc.
- Norvell, T. (2002). A short introduction to regular expressions and context free grammars. Project Report, Nov, 5
- Oram, A., & Wilson, G. (2007). Beautiful code (1st ed.). Beijing; Sebastapol, Calif: O'Reilly.
- Sipser, M. (2006). Introduction to the theory of computation (2nd ed.). Boston: Thomson Course Technology.
- Stubblebine, T. (2007). Regular expression pocket reference: Regular expressions for perl, ruby, PHP, python, C, java and .NET. Sebastopol: O'Reilly Media, Incorporated.
- Thompson, K. (1968). Programming techniques: Regular expression search algorithm. Communications of the ACM, 11(6), 419-422.
Footnotes
- Referencing RFC6854: https://tools.ietf.org/html/rfc6854
- Referencing Stackoverflow: https://stackoverflow.com