NBA Analytics

From UBC Wiki

Augustine Kwong

Ben Daly-Grafstein


A rule based model of over play-by-play statistics of the NBA 2018 Season to predict the winner between two teams.

What is the problem?

The three most important stats the determine the winner of a given basketball game are field goal percentage, total rebounds, and total turnovers by each team. We were looking for a way to accurately predict winners of NBA games by using historical values of these statistics between the two playing teams.

What is something extra?

We wanted a way to differentiate between teams that had very similar historical field goal percentages, rebounds per game, and turnovers per game against one another. Often the team that is able to win between two evenly matched teams is the one that performs best at the end of the game in so called "clutch situations". Therefore, we defined a clutchness rating for each player based on their percentage of shots made within the finals minutes of games. Subsequently, we gave each team a overall clutchness score wherein a single clutch player or two on a team is very highly scored. We used this to differentiate between teams that were otherwise fairly evenly matched. Play-by-play data proved to be quite suitable for this.


By using the parser provided by [[ Stallman's NBA RDF Parser]], we write our script to download all ESPN Play-by-Play data for the 2018 Season games:


FOLDER=data/retrieved_`date +%s`
echo "Fetching play-by-play files into $FOLDER"
mkdir -p $FOLDER/

mkdir $FOLDER/nba-2018-2019-season  

for ((i=401070213;i<=401070966;i++)) do
  curl$i > $FOLDER/nba-2018-2019-season/$i.html
  curl$i > $FOLDER/nba-2018-2019-season/$i-gameinfo.html

Using the parser, we generate the Terse RDF Triple Language (TTL) knowledge base

$ ./pbprdf nba-2018-2019-season/ nba-2018.ttl

This creates about 3 million triple nodes. We design our prediction algorithm based on different metrics on the play-by-play data, using the RDF prolog query library.

What did we learn?

The RDF prolog query library proved to be a suitable system for querying and aggregating data in flexible ways. Since we were working with play-by-play data which is the basic unit of NBA statistics, it was straightforward to query almost any stat we could imagine. One issue that arose was the length of time required to complete certain queries. We were forced to look at alternatives for calculating the same statistic. For example, we originally calculated a team's turnovers per game against a specific opponent on a player-by-player basis. This took a significant amount of time. Eventually, we shortened the time of the query by ignoring the individual players that the turnovers were attributed to and aggregating on a time-wide basis.

Prolog was also a very suitable language because its style matched the way in which we built up statistics. Just as prolog uses simple true statements to build up to more and more complex ones, we built up complex statistics from simple ones. For instance, to calculate a team's rebounds per game versus a specific opponent, we began by finding events that were rebounds. We then looks among these for rebounds by players on the team in games against the specified opponent. Next we used these player totals along with the number of games played between the two teams to come up with rebounds per game.

Link to Project