Aho-Corasick algorithm

From UBC Wiki
Jump to navigation Jump to search

Authors: Andrew Sung, Nathan Hsu

What is the problem?

We aim to determine whether Haskell is suitable for pattern matching a list of patterns to a list of texts. We will test this by implementing a variation of the Aho-Corasick algorithm. In particular, we will store the patterns using a trie.

Given a list of patterns and a list of texts, we will output all occurrences of each pattern. We will also rank the texts by the number of unique patterns they contain. It will function as a very basic search engine.

For example, given patterns ["a", "b"] and texts ["apple", "banana"], we will output

banana: 3 "a" found at position 2, 4, 6; 1 "b" found at position 1

apple: 1 "a" found at position 1

What is the something extra?

We will read the list of patterns and the list of texts from files using I/O.

What did we learn from doing this?

In Haskell, it can be difficult to implement tree operations when it feels more natural to conceptualize the operation as a modification of some small part an existing tree. That is, whereas an imperative language would have likely supported the modification of objects via references (e.g. a reference to a node in the tree), the functional implementation always required the tree itself plus a way to find the node we wished to update within that tree. This convinced us that a tree would be better implemented using an indexed list-like structure rather than using a node which recursively contained its children (like a tree in JSON).

We also learnt that Haskell lists are linked lists, so index lookups take O(n). This was not ideal for A-C, as to calculate the next state we must jump between nodes (potentially) many times. Data.Map and Data.IntMap facilitated this need for faster lookups.

Link to code