Regular Expressions

Regular expressions, sometimes also called "regex" or "regexp", are specially encoded text strings used as patterns for matching sets of strings (Fitzgerald, St. Laurent, & Demarest, 2012). They follow a simple syntax but deliver great expressive power. Stephen Cole Kleene, a mathematician, invented regular expressions in the mid-1950s (Oram & Wilson, 2007), and Kenneth Lane Thompson, a computer scientist, introduced Kleene’s invention to the computer science community in the mid-1960s (Thompson, 1968). Since then, regular expressions have served as the input language for many systems that process strings to facilitate searching (Hopcroft, Motwani, & Ullman, 2000; Sipser, 2006). Many applications such as text editors, and modern programming languages like Python and JavaScript, support regular expression as a means to enable advanced matching and querying (Nield, 2019). Compilers of conventional programming languages often perform regular expression matching for tokenization (lexical analysis) of a given source file (Aho, Lam, Sethi, & Ullman, 2007).

Basic Regular Expressions

Regular Expressions (REs) are the algebraic description of languages (Hopcroft et al., 2000; Sipser, 2006). The following are the fundamental regular expressions that can not be broken apart, and the advanced regular expressions are derived from the fundamentals (See fundamental operators and derived operators below).

REs	Language of REs	Description
a	L(a)	singleton set: a set contains character a only
ε	L(ε)	an empty string contains no characters (the length of the string is 0) in a set
∅	L( )	an empty set/language

Fundamental Operators on Regular Expressions

If r₁ and r₂ are regular expressions, here are three fundamental operators on them (Hopcroft et al., 2000; Sipser, 2006).

REs	Language of REs	Description	Precedence	Example
r₁r₂	L(r₁)L(r₂)	Concatenation: the language generated by r₁ is followed by the language generated by r₂	middle	r₁r₂
r₁\|r₂	L(r₁) ∪ L(r₂)	Combination/Union of r₁ and r₂: same function of Boolean operator – OR without the duplication	lowest	r₁, r₂
r₁*	L(r₁)*	Kleene closure: 0 or more occurrences of r₁	highest	ε, r₁, r₁r₁, r₁r₁r₁, …

Derived Operators on Regular Expressions

The above three fundamental operators are important but may cause users inconvenience when writing some complex regular expressions. Derived operators (Aho, Lam, Sethi, & Ullman, 2007) are thus included to allow users to write certain regular repressions more efficiently. It is worth noting that the introduction of derived operators does not increase the same expressive power of regular expressions (see challenges below).

REs	Description	Example
REs	Description	Regular Expression with Fundamental Operator	Regular Expression with Derived Operator	Highlighting Possible Match
[ ]	square expression matches a given character enclosed in the brackets	reali(x\|y\|z)e	reali[xyz]e	realize
[ - ]	square expression matches a character in the given range	reali(a\|b\|c\|d\|e\|f\|g\|h\|i\|j\|k\|l\|m\|n\|o\|p\|q\|r\|s\|t\|u\|v\|w\|x\|y\|z)e	reali[a-z]e	realise
?	question mark matches 0 or 1 occurrence of the preceding RE	diarrh(o\|)ea	diarrho?ea	diarrhea!
+	plus sign matches 1 or more occurrences of the preceding RE	(cccc*)	ccc+	ccccd

Further Enhancement: Programming-Language Specific Regular Expressions

To provide greater convenience for users to write REs in the search field, the programming languages tailor-make their own REs and shorthand character classes of the same expressiveness based on the fundamental REs. The following are part of the enhancements for the Python programming language (Stubblebine, 2007).

Meta-character	Description	Example
Meta-character	Description	Regular Expression with Basic and Derived Operator	Regular Expression with Metacharacter	Highlighting Possible Match
[^ ]	caret matches a character does not exist in the brackets	N.A.	[^abc]	abcdcba
\	\d matches a whole number	[0-9][0-9][0-9][0-9]	\d\d\d\d	2021 year
	\w matches a word character, i.e., the set of alphanumeric characters and underscore	[a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_]	\w\w\w	Yo3?
	\W matches a non-word character	[^a-zA-Z0-9_][^a-zA-Z0-9_][^a-zA-Z0-9_]	\W\W\W	^-^
.	period matches a character (except for line terminators \n)	pig[^a-zA-Z0-9_]\|pig[a-zA-Z0-9_]	pig.	pig@s
^	caret matches a character at the beginning of the line or string	N.A.	^ab	abc
$	dollar sign matches a character at the end of the line or string	N.A.	ig$	pig
{n₁}	curly brackets specify the preceding item is matched exactly n₁ time(s)	N.A.	z{3}	xyzzz
{n₁,}	curly brackets specify the preceding item is matched n₁ or more times	N.A.	z{2,}	xyzzzz
{n₁,n₂}	curly brackets specify the preceding item is matched from n₁ to n₂ times	N.A.	z{2,3}	xyzz

More Complicated Examples of Programming-Language Specific Regular Expressions

Users can test the regular expressions on free online platforms such as Regular Expressions 101 and RegExr. The following REs are entered in the search box to find the matches in the content of “About the iSchool” (Sample).

REs	Highlighting Sample Match
\W@\d\W\d\s\w	@1:35 pm
^U\w*\|\d\d\d\d	UBC
^U\w*\|\d\d\d\d	2020
\w*o{2}\w.$	iSchool.
[A-Z]+[^\w][A-Z][a-z]*	UBC School
[S-Z]\w+[s-z]s?	Studies
	University
	Times
	Scholarships

Challenges

Non-Regular Expressions

Regular expressions are powerful but can not describe and express everything in the world. To increase the expressiveness, some systems and programming languages allow their so-called "regular expressions" in the querying field to be context-free (Lewis & Lacher, 2017; Norvell, 2002) or even context-sensitive (Stubblebine, 2007), blurring the definition of REs and causing confusion to users. Users are likely to understand the "regular expressions" by the first programming language they first learned or the system they first came across. Meanwhile, they may not notice that there is a trade-off between the computing power of the search engines and the expressiveness of these "Regular Expressions".

Non-Transferable Skills

Along the same line, programming languages such as Python and JavaScript have established their specific regular expressions. Due to the non-identical syntax and supporting features of the regular expressions among the programming languages and systems, users can not fully apply and transfer what they learned of REs in Python to JavaScript to retrieve search results (Stubblebine, 2007). They also do not understand why the regular expression flavors share the same basic but also different advanced features. Sometimes, they may even not be aware of the subtle differences across various flavors of "regular expressions".

To find any email addresses in a given text, here are the regular expressions for different implementations^1,2. The highlighted differences show that both languages have different escape characters and sequences.

Python	JavaScript
((([\t ]\r\n)?[\t ]+)?[-!#-'+/-9=?A-Z^-~]+(\.[-!#-'+/-9=?A-Z^-~]+)(([\t ]\r\n)?[\t ]+)?\|(([\t]\r\n)?[\t ]+)?\"(((([\t ]\r\n)?[\t ]+)?([]!#-[^-~]\|(\\[\t -~])))+(([\t ]\r\n)?[\t]+)?\|(([\t ]\r\n)?[\t ]+)?)\"(([\t ]\r\n)?[\t ]+)?)@((([\t ]\r\n)?[\t]+)?[-!#-'+/-9=?A-Z^-~]+(\.[-!#-'+/-9=?A-Z^-~]+)(([\t ]\r\n)?[\t ]+)?\|(([\t ]\r\n)?[\t ]+)?\[((([\t]\r\n)?[\t ]+)?[!-Z^-~])(([\t ]\r\n)?[\t ]+)?](([\t ]\r\n)?[\t ]+)?)	((([\t ]\r\n)?[\t ]+)?[-!#-'+/-9=?A-Z^-~]+(\.[-!#-'+/-9=?A-Z^-~]+)(([\t ]\r\n)?[\t ]+)?\|(([\t]\r\n)?[\t ]+)?"(((([\t ]\r\n)?[\t ]+)?([]!#-[^-~]\|(\\[\t -~])))+(([\t ]\r\n)?[\t ]+)?\|(([\t ]\r\n)?[\t]+)?)"(([\t ]\r\n)?[\t ]+)?)@((([\t ]\r\n)?[\t ]+)?[-!#-'+/-9=?A-Z^-~]+(\.[-!#-'+/-9=?A-Z^-~]+)(([\t]\r\n)?[\t ]+)?\|(([\t ]\r\n)?[\t ]+)?\[((([\t ]\r\n)?[\t ]+)?[!-Z^-~])(([\t ]\r\n)?[\t ]+)?](([\t]\r\n)?[\t ]+)?)
^[\w\.]+@([\w-]+\.)+[\w-]{2,4}$	^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$

Unicode Support

Unicode is very important as it allows users to perform computing tasks in their native languages. Unfortunately, most RE engines from common programming languages only support ASCII characters and offer limited/no support to Unicode standards (e.g., UTF-8).

Bibliography

Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2007). Compilers: Principles, techniques, and tools (2nd ed.)
Fitzgerald, M., St. Laurent, S., & Demarest, R. (2012). Introducing regular expressions (1st ed.). Sebastopol, CA: O'Reilly.
Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2000). Introduction to automata theory, languages, and computation (2nd ed.) Addison Wesley.
Lewis, M. C., & Lacher, L. (2017). Regular expressions and context-free parsers. (pp. 549-568) Chapman and Hall/CRC.
Nield, T. (2019). Introduction to regular expressions (1st ed.) O'Reilly Media, Inc.
Norvell, T. (2002). A short introduction to regular expressions and context free grammars. Project Report, Nov, 5
Oram, A., & Wilson, G. (2007). Beautiful code (1st ed.). Beijing; Sebastapol, Calif: O'Reilly.
Sipser, M. (2006). Introduction to the theory of computation (2nd ed.). Boston: Thomson Course Technology.
Stubblebine, T. (2007). Regular expression pocket reference: Regular expressions for perl, ruby, PHP, python, C, java and .NET. Sebastopol: O'Reilly Media, Incorporated.
Thompson, K. (1968). Programming techniques: Regular expression search algorithm. Communications of the ACM, 11(6), 419-422.

Footnotes

Referencing RFC6854: https://tools.ietf.org/html/rfc6854
Referencing Stackoverflow: https://stackoverflow.com