Course:CPSC312-2023/Webscraper

From UBC Wiki

What is the problem?

There is an abundance of information available on the web that can at times be difficult to collect/access, especially on the UBC course website. To lighten the load on students during registration season, we will be creating a webscraper that scrapes data relevant to classes from the UBC website and converts it into a useful storage format.

Each class is stored with a list of mandatory components(i.e. lectures, labs, tutorials), and each component has it's own list of viable sections. In each section, we store the start and end time, the day of the week it occurs, and the term, as well as the section name.

What is the something extra?

We have also built an algorithm to create a schedule for students out of the scraped data. Once the data is scraped, it's organized into a full list of data, and an algorithm parses through it to construct a schedule for the student to register with information on the course, section name, term and duration, without any overlapping classes. Once a list of registrations is completed, it's written to a file for the user to refer to.

What did we learn from doing this?

Prolog's natural pattern matching process makes algorithms like our scheduling much easier than other languages The web-scraping and data saving made it seamless to copy data from HTML pages using xpath() and place it into the data structures we created. We were able to create ad-hoc data structures for transferring information between functions using Prolog's natural pattern matching that likely would've required burdensome class definitions in other languages. This made moving data back and forth, and transcribing it into the forms we needed for our scheduling algorithms, incredibly quick and easy. In a strongly-typed language, or one without such powerful pattern matching, our web scraping likely would've been much more cumbersome.

Prolog's untyped nature can also be a slight hindrance, as we had some problems with inconsistencies across our data definitions that took a little bit to resolve. Strong contracts for data are important, as Prolog won't enforce type conformity like other languages, so it's important to ensure multiple functions and rules are all consistent with one another.

We also learned the importance of detail-oriented testing in Prolog. PLUnit tests were important for ensuring our code worked properly due to the relative paucity of useful debugging tools in Prolog. We wrote several such tests to ensure our code functioned, and after some refactoring, we couldn't figure out why one wasn't passing. Turns out we had forgot to update one of the test variables, trapping the test in an infinite loop of trying to find a valid rule. Once again, another lesson showing us the importance of detailed thinking in Prolog, as the language does very minimal handholding.

Pattern matching is also not infinitely powerful. We attempted to further improve our algorithm by coding a naive overlap checker that would remove all overlapping sections from the remaining components, but the pattern checking was so burdensome it was unviable. This extra code is still included in the project for completeness' sake, but we used an alternate algorithm in the end.

Links to code etc.

https://github.com/lamolivia/PrologWebscraper