From UBC Wiki

Text Summarizer


Dan, Michael, Louise

What is the problem?

Although Prolog has been historically associated with NLP and the logic programming paradigm offers many advantages for data processing and rule-based information extraction, it is not clear Prolog is well suited for NLP using statistical methods. Specifically, we want to investigate how suitable is Prolog for text summarization.

We plan to build a tool that takes an English text as input and outputs a summary of the contents of the input text. Our tool will also output the main keywords of the text.

What is the something extra?

We used WordNet (a lexical database) to filter words that were not very meaningful when selecting keywords from the text (such as adverbs and adjectives). We also added functionality to detect plurals and replace plural keywords with their singular form. Furthermore, our scoring function gave low scores to words common in English (such as "the", "and", "to", etc) and we normalized sentence scores so that long sentences didn't always get the highest scores.

What did we learn from doing this?

When using large texts we ran into the issue of reaching the stack limit. Although this can be fixed by increasing the stack limit, it is inefficient and takes a long time to run for large texts. We also found that Prolog was difficult to use for some tasks that would be very simple in a language such as Python. For example when we calculated a score for each word we wanted to store these scores in such a way that we could then access them in constant time when computing sentence scores. However since arrays do not exist in Prolog this proved to be a difficult task. Despite this, the results we obtained are adequate and our Prolog code is very concise and easy to extend. Given more time we would incorporate WordNet more such as using it when computing scores to have just one score for words that are very similar in meaning. Since there is a version of WordNet written in Prolog, using it is fairly simple.

Links to code etc