Course:CPSC312-2023/RegexReader

From UBC Wiki


Authors: Michelle, Owen, Floria

What is the problem?

Regular Expressions, also known as Regex, are used to detect patterns in text and have useful applications ranging from string search algorithms to input validation. In this project, we use Prolog to implement a basic regular expression engine that accepts a regular expression and a text string as input, and returns true if the text matches the regular expression. Our utility first parses regular expressions into a syntax tree, enabling it to validate regular expressions against the supported syntax and differentiate between invalid syntax versus an actual lack of matches.

Real regular expression engines are complex and support a wide range of features. Within our limited scope, we support the following features:

Syntax Example Meaning
| (disjunction) a|b Matches either the pattern on the left side or the pattern on the right side.
. (dot) . Matches any character
\d (digit) \d Matches a digit character (i.e. 0 to 9)
\D (non-digit) \D Matches a non-digit character
\s (whitespace) \s Matches a whitespace character (space, tab, newline, etc)
\S (non-whitespace) \S Matches a non-whitespace character
\w (word character) \w Matches a word character (uppercase and lowercase letters, numbers, and underscore _)
\W (non-word character) \W Matches a non-word character
[X-Y] (character class) [A-Z] Matches a character in the range X to Y (inclusive)
[Chars] (character class) [abc] Matches one character in Chars
[^X-Y] (negated character class) [^A-Z] Matches a character outside the range X to Y
[^Chars] (chracter class) [^abc] Matches a character not in Chars
* (0 or more) (ab)* Matches 0 or more of the preceding pattern
+ (1 or more) (ab)+ Matches 1 or more of the preceding pattern
? (0 or 1) (ab)? Matches 0 or 1 of the preceding pattern

See this cheat sheet for further description of the features.

Note to check the matching feature run prove.pl in the repository.

What is the something extra?

For our extra feature, we have designed our engine to evaluate regular expressions by compiling them to Deterministic Finite Automatons (DFAs), against which strings can then be checked. This is the traditional approach to regular expression evaluation, and is considered to be more efficient than standard backtracking matchers in the worst case.

What did we learn from doing this?

(This should be written after you have done the work.) What is the bottom-line? Is functional programming suitable for (part-of) the task? Make sure you include the evidence for your claims.

Prolog was well-suited for some parts of the regular expression evaluation pipeline: notably parsing and syntax tree manipulation. After consulting the context-free grammar of regular expressions, parsing a regular expression to a syntax tree was efficient and clean using difference lists and function symbols. Annotating this tree with information needed to construct the DFA was also straightforward using Prolog's pattern matching and recursive rules.

We learned, however, that it can be challenging to use Prolog for applications where imperative algorithms are more common and well-optimized. Algorithms designed for an imperative programming style - such as the regex to DFA algorithm - often rely on constructs such as while loops, which are difficult and less intuitive to express in Prolog.

Since Prolog does not place emphasis on the definition of custom classes and data structures, we also encountered challenges when SWI-Prolog's built-in data structures (such as ugraph) did not provide all the functionality we needed. The ability to define custom classes and data structures, like in languages such as Java or C++, would have been helpful for many tasks. Instead, we used Prolog predicates to implement a dictionary and the DFA graph - for example, we represented DFAs using the following predicates:

  • state(X): X is a state
  • start(X): X is a start state
  • accepting(X): X is an accepting state
  • transition(X,C,Y): there a transition from state X to state Y on symbol C

Using this more Prolog-like representation, it was simple and intuitive to prove a string against a DFA (minus a couple manipulations to the input to allow the recursive elements to function properly). The code also runs quickly. However, we note that to construct a DFA dynamically, we had to dynamically assert these predicates into the database; this practice may make larger Prolog programs that rely on assert to define custom data structures difficult to debug.

Links to code etc.

https://github.com/zifgu/regex-reader/tree/master

References

For the standard grammar of regular expressions, we consulted the following source:

For the regex to DFA algorithm, we consulted the following sources: