CPSC312-2019-Crossword-Builder

Crossword Builder

What is the problem?

We want to investigate whether Haskell is an effective tool for building 'sparse' crossword puzzles. We will build a program that accepts a list of words as input and attempts to create and display a crossword puzzle built out of those words. Some general algorithms for tackling this problem exist (for example: here), but will require some creativity to properly represent the data, display the puzzle, and implement the algorithm in Haskell.

What is the something extra?

While we adopt a general algorithm for building a crossword puzzle from a list of words, we are writing our program in Haskell from scratch. Once we have implemented this basic algorithm for generating valid crosswords, we intend to introduce a number of optimizations to improve the algorithm's efficiency and its ability to create larger, more complex puzzles. These optimizations include:

1) Representing the state of our puzzle (the words we have placed and failed to place) as we build it, and repeatedly running our algorithm until it no longer improves the puzzle's state;

2) Making sure that our actual implementation of the algorithm is efficient, by using more efficient data structures (such as Haskell's Map structure) wherever possible;

3) Implementing a strategy based on randomization in order to search for an optimal crossword from a given list of words: users input a number of attempts to try, and the best crossword generated from all those attempts is displayed to the user. Randomness will be introduced into the part of the algorithm that selects from a number of valid locations for a particular word. This approach allows us to sample the solution space without searching exhaustively (which quickly becomes infeasible as puzzle size increases, regardless of other optimizations).

What did we learn from doing this?

The program we built demonstrates the technical feasibility of using Haskell to construct a crossword puzzle. We learned about Haskell IO, and used this to allow input in two different ways: by reading from a file and through an interactive interface. Although IO was needed for input, the construction of the actual crossword puzzle could happen in 'pure' functional code, and thus Haskell was well suited to the task.

Because the number of permutations of words on a board grow rapidly with the size of the word list and the size of the board, a brute force solution of trying all possible word combinations on a board quickly becomes prohibitively expensive. Instead an 'eager' approach in which words are added to the board once at a time, always maintaining a valid crossword state, is a promising approach to building a crossword. This eager approach, however, is not guaranteed to produce an optimal crossword (where an optimal crossword is one that places the most possible words on the board). A 'minimax' variant of this eager approach is possible, but will take a considerable amount of time to process since there are so many possible sequences of word placements in particular locations.

We instead opted for a 'randomized' approach, in which the placement of a word within the eager approach is always chosen randomly from the possible valid placements on the board. This means that repeated attempts to build a board can be made, which are highly likely to generate different valid crosswords. If a valid crossword is constructed that places all the given words, this algorithm exits immediately. When given ample board space, an optimal crossword will be generated quickly. Else, the best crossword generated so far will be tracked across all of the randomized eager builds and returned after a number of iterations set by the user. In practice, we found that this randomized approach worked well to identify optimal crosswords even in cases that have few (or even unique) optimal word arrangements.

Here are some more detailed thoughts on what we learned about optimizing algorithms for building a sparse crossword puzzle:

We set out to build 'sparse' crossword puzzles (in which there cannot be adjacent horizontal or vertical words), which require a different treatment than 'dense' crossword puzzles. Dense crossword puzzles are typically built by starting with a board with preset blocked spaces, a long list of words, and finding some subset of those words that fit into the given board. Sparse crossword puzzles, in contrast, can be built by taking a shorter list of words, and recursively adding new words to a blank board. The goal is to place all (or as many as possible) of the given words onto the given board.

Because it is possible that there is a unique order in which a given list of words can be placed, finding the correct order of words to place will take O(w^2) in the worst case, where w is the length of the list of words to be placed. To see this consider the list ("af", "bd", "cb", "fc"); if "af" is placed first, the full list must be scanned to find the next word that can be placed ("fc"), and then the list must be scanned again to find the next word to be placed, and so on.

For each word, we must then determine whether it fits on the board. This requires finding a letter on the board that is also a letter in the word, then checking both the board cells that the new word will occupy (to make sure there are no letter conflicts with existing words) and all adjacent cells (to ensure that the board stays sparse/that no new words are incidentally added to the board). This placement checking requires repeated looking in two directions: from characters to board cell coordinates (to get a list of all board cells containing a particular character), and from board cell coordinates to characters (to check the content of a particular cell). Our original 'naive' implementation maintained a list of placed letters for these lookups, which works but requires a large number of repeated scans of the list of placed letters.

To improve upon the naive implementation, we settled on using map from cell coordinates to characters instead of a list. This does not improve the big-O complexity of our solution, but represents a huge constant factor improvement as a large number of linear scans of the placed letters (proportional to the length of the word being placed) are replaced with logarithmic map lookups. With a large test case (200 words, n=30), this took our run time from ~80seconds to ~5 seconds. Big-O complexity does not improve because we still need linear time scans of the placed letters when searching for the list of cells containing a particular character.

We also explored a further algorithmic 'improvement' by maintaining two placed letter structures: a map from cell coordinates to characters and a map from characters to lists of coordinates. In the worst case (if most of the characters on the board are the same) big-O complexity would not improve, but given an even distribution of the characters 'a' to 'z' on the board this would reduce remaining linear time scans of placed letters by a factor of 26. In practice, we found that the speed improvement that this dual structure approach offered was minimal. This is due in part to the fact that letters on a crossword board are not generally an even distribution of characters 'a' through 'z', and also due in part to the added costs of maintaining two data structures that must be updated every time a new letter is placed on the board. In the end, the added code complexity and space requirements needed by this approach did not seem worthwhile.

Another improvement we considered and implemented was to replace randomization with something less computationally expensive. In our case we settled on rotating (rather than randomizing) a list, which generated similar variance in built crosswords but with a significant improvement in speed.