Course:LIBR557/2020WT2/regular expressions

From UBC Wiki

Regular Expressions

Regular expressions, sometimes also called "regex" or "regexp", are specially encoded text strings used as patterns for matching sets of strings (Fitzgerald, St. Laurent, & Demarest, 2012). They follow a simple syntax but deliver great expressive power. Stephen Cole Kleene, a mathematician, invented regular expressions in the mid-1950s (Oram & Wilson, 2007), and Kenneth Lane Thompson, a computer scientist, introduced Kleene’s invention to the computer science community in the mid-1960s (Thompson, 1968). Since then, regular expressions have served as the input language for many systems that process strings to facilitate searching (Hopcroft, Motwani, & Ullman, 2000; Sipser, 2006). Many applications such as text editors, and modern programming languages like Python and JavaScript, support regular expression as a means to enable advanced matching and querying (Nield, 2019). Compilers of conventional programming languages often perform regular expression matching for tokenization (lexical analysis) of a given source file (Aho, Lam, Sethi, & Ullman, 2007).

Basic Regular Expressions

Regular Expressions (REs) are the algebraic description of languages (Hopcroft et al., 2000; Sipser, 2006). The following are the fundamental regular expressions that can not be broken apart, and the advanced regular expressions are derived from the fundamentals (See fundamental operators and derived operators below).

REs Language of REs Description Diagram
a L(a) singleton set: a set contains character a only a Set
ε L(ε) an empty string contains no characters (the length of the string is 0) in a set Empty String
L( ) an empty set/language Empty Set

Fundamental Operators on Regular Expressions

If r1 and r2 are regular expressions, here are three fundamental operators on them (Hopcroft et al., 2000; Sipser, 2006).

REs Language of REs Description Precedence Example
r1r2 L(r1)L(r2) Concatenation: the language generated by r1 is followed by the language generated by r2 middle r1r2
r1|r2 L(r1) ∪ L(r2) Combination/Union of r1 and r2: same function of Boolean operator – OR without the duplication lowest r1, r2
r1* L(r1)* Kleene closure: 0 or more occurrences of r1 highest ε, r1, r1r1, r1r1r1, …

Derived Operators on Regular Expressions

The above three fundamental operators are important but may cause users inconvenience when writing some complex regular expressions. Derived operators (Aho, Lam, Sethi, & Ullman, 2007) are thus included to allow users to write certain regular repressions more efficiently. It is worth noting that the introduction of derived operators does not increase the same expressive power of regular expressions (see challenges below).

REs Description Example
Regular Expression with Fundamental Operator Regular Expression with Derived Operator Highlighting Possible Match
[ ] square expression matches a given character enclosed in the brackets reali(x|y|z)e reali[xyz]e realize
[ - ] square expression matches a character in the given range reali(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)e reali[a-z]e realise
? question mark matches 0 or 1 occurrence of the preceding RE diarrh(o|)ea diarrho?ea diarrhea!
+ plus sign matches 1 or more occurrences of the preceding RE (cccc*) ccc+ ccccd

Further Enhancement: Programming-Language Specific Regular Expressions

To provide greater convenience for users to write REs in the search field, the programming languages tailor-make their own REs and shorthand character classes of the same expressiveness based on the fundamental REs. The following are part of the enhancements for the Python programming language (Stubblebine, 2007).

Meta-character Description Example
Regular Expression with Basic and Derived Operator Regular Expression with Metacharacter Highlighting Possible Match
[^ ]caret matches a character does not exist in the bracketsN.A.[^abc]abcdcba
\ \d matches a whole number [0-9][0-9][0-9][0-9] \d\d\d\d 2021 year
\w matches a word character, i.e., the set of alphanumeric characters and underscore [a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_] \w\w\w Yo3?
\W matches a non-word character [^a-zA-Z0-9_][^a-zA-Z0-9_][^a-zA-Z0-9_] \W\W\W ^-^
. period matches a character (except for line terminators \n) pig[^a-zA-Z0-9_]|pig[a-zA-Z0-9_] pig. pig@s
^ caret matches a character at the beginning of the line or string N.A. ^ab abc
$ dollar sign matches a character at the end of the line or string N.A. ig$ pig
{n1} curly brackets specify the preceding item is matched exactly n1 time(s) N.A. z{3} xyzzz
{n1,} curly brackets specify the preceding item is matched n1 or more times N.A. z{2,} xyzzzz
{n1,n2} curly brackets specify the preceding item is matched from n1 to n2 times N.A. z{2,3} xyzz

More Complicated Examples of Programming-Language Specific Regular Expressions

Users can test the regular expressions on free online platforms such as Regular Expressions 101 and RegExr. The following REs are entered in the search box to find the matches in the content of “About the iSchool” (Sample).

REs Highlighting Sample Match
\W@\d\W\d*\s\w*  @1:35 pm
^U\w*|\d\d\d\d UBC
2020
\w*o{2}\w.$ iSchool.
[A-Z]+[^\w][A-Z][a-z]* UBC School
[S-Z]\w+[s-z]s? Studies
University
Times
Scholarships

Challenges

Non-Regular Expressions

Regular expressions are powerful but can not describe and express everything in the world. To increase the expressiveness, some systems and programming languages allow their so-called "regular expressions" in the querying field to be context-free (Lewis & Lacher, 2017; Norvell, 2002) or even context-sensitive (Stubblebine, 2007), blurring the definition of REs and causing confusion to users. Users are likely to understand the "regular expressions" by the first programming language they first learned or the system they first came across. Meanwhile, they may not notice that there is a trade-off between the computing power of the search engines and the expressiveness of these "Regular Expressions".

Non-Transferable Skills

Along the same line, programming languages such as Python and JavaScript have established their specific regular expressions. Due to the non-identical syntax and supporting features of the regular expressions among the programming languages and systems, users can not fully apply and transfer what they learned of REs in Python to JavaScript to retrieve search results (Stubblebine, 2007). They also do not understand why the regular expression flavors share the same basic but also different advanced features. Sometimes, they may even not be aware of the subtle differences across various flavors of "regular expressions".

To find any email addresses in a given text, here are the regular expressions for different implementations1,2. The highlighted differences show that both languages have different escape characters and sequences.

Python JavaScript
((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t]*\r\n)?[\t ]+)?\"(((([\t ]*\r\n)?[\t ]+)?([]!#-[^-~]|(\\[\t -~])))+(([\t ]*\r\n)?[\t]+)?|(([\t ]*\r\n)?[\t ]+)?)\"(([\t ]*\r\n)?[\t ]+)?)@((([\t ]*\r\n)?[\t]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?\[((([\t]*\r\n)?[\t ]+)?[!-Z^-~])*(([\t ]*\r\n)?[\t ]+)?](([\t ]*\r\n)?[\t ]+)?) ((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t]*\r\n)?[\t ]+)?"(((([\t ]*\r\n)?[\t ]+)?([]!#-[^-~]|(\\[\t -~])))+(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t]+)?)"(([\t ]*\r\n)?[\t ]+)?)@((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?\[((([\t ]*\r\n)?[\t ]+)?[!-Z^-~])*(([\t ]*\r\n)?[\t ]+)?](([\t]*\r\n)?[\t ]+)?)
^[\w\.]+@([\w-]+\.)+[\w-]{2,4}$ ^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$

Unicode Support

Unicode is very important as it allows users to perform computing tasks in their native languages. Unfortunately, most RE engines from common programming languages only support ASCII characters and offer limited/no support to Unicode standards (e.g., UTF-8).

Bibliography

  • Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2007). Compilers: Principles, techniques, and tools (2nd ed.)
  • Fitzgerald, M., St. Laurent, S., & Demarest, R. (2012). Introducing regular expressions (1st ed.). Sebastopol, CA: O'Reilly.
  • Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2000). Introduction to automata theory, languages, and computation (2nd ed.) Addison Wesley.
  • Lewis, M. C., & Lacher, L. (2017). Regular expressions and context-free parsers. (pp. 549-568) Chapman and Hall/CRC.
  • Nield, T. (2019). Introduction to regular expressions (1st ed.) O'Reilly Media, Inc.
  • Norvell, T. (2002). A short introduction to regular expressions and context free grammars. Project Report, Nov, 5
  • Oram, A., & Wilson, G. (2007). Beautiful code (1st ed.). Beijing; Sebastapol, Calif: O'Reilly.
  • Sipser, M. (2006). Introduction to the theory of computation (2nd ed.). Boston: Thomson Course Technology.
  • Stubblebine, T. (2007). Regular expression pocket reference: Regular expressions for perl, ruby, PHP, python, C, java and .NET. Sebastopol: O'Reilly Media, Incorporated.
  • Thompson, K. (1968). Programming techniques: Regular expression search algorithm. Communications of the ACM, 11(6), 419-422.

Footnotes

  1. Referencing RFC6854: https://tools.ietf.org/html/rfc6854
  2. Referencing Stackoverflow: https://stackoverflow.com