Data Access Instructions

The data folder contains our data scraped from BASKETBALL REFERENCE

  • game_stats_all.csv - The game results (scores) for all regular season games from season 2015-16 to 2018-19 and for regular season games already played in season 2019-20 and the upcoming schedule for the remaining games in season 2019-2020.
  • team_stats_all.csv - The team statistics of first 20 games from season 2015-16 to 2019-20.
  • pre_20_games.rds - the results for the first 20 games for each team during season 2019-20
  • post_62_games_pred.rds - predictions for the rest of season 2019-2020

You only need the first two csv files to run the rmarkdown code.

Motivation and Overview

The National Basketball Association (NBA) is a professional basketball league in North America that draws the attention of many basketball fans all over the world. During the regular season, each of the 30 teams (15 in the Eastern Conference and 15 in the Western Conference) plays 82 games. At the conclusion of the regular season, the top eight teams in each conference advance to the playoffs for a chance to compete for the championship title. Towards the end of the regular season, the race for the final few playoff spots is very intense, and usually it isn’t until the final game of the regular season is over when we know which teams have made the cut. As NBA fans ourselves, we are eager to find out which teams will make the playoffs this year (2019-20 NBA season), but rather than wait until next April, as data analysts, we thought it would be interesting if we could make such predictions now. We hope our analysis will lead to meaningful insights that will be of interest to NBA fans and NBA data analysts all over the world.

Project Objectives

The goal of this project is to build a statistical machine learning model that can predict the regular season winning percentages of the NBA teams in the Western Conference, and thus the playoff teams (the eight teams with the highest winning percentages).

We will undergo a comprehensive background study of the NBA to manually decide on what features to use. Exploratory data analysis will aid in our selection of features. We also plan to evaluate different models. A successful model will predict the 2019-2020 NBA playoff teams in the Western Conference. Ideally, the model could also be trained to predict the playoff teams in the Eastern Conference and finally round-by-round playoff predictions, though this is beyond the scope of our project.

Target Audience

The target audience we have in mind are NBA data analysts and NBA fans. We have prepared an analysis report and a project website that contain some of the more technical details of the analysis that may be of more interest to analysts, as well as a screencast that captures some of the major findings of the project, which NBA fans will hopefully enjoy learning about.

Data

Data Source

Most data were collected from the Basketball Reference website (https://www.basketball-reference.com/). The database contains player statistics including points per game, assists per game, rebounds per game, etc. of all players who have played in the NBA league since 1949-50 (the inaugural season of the NBA). The database also provides team statistics, including team 3-point percentages, winning percentages, etc. Player injury data were obtained from spotrac (https://www.spotrac.com/nba/injured-reserve/).

Data Scraping

We used rvest package to scrape the following data from Basketball Reference website:

  • team_stats: The team statistics of first 20 games from season 2015-16 to 2019-20. We would use these data to define the team performance during specific season in the feature engineering section.

  • game_stats: The game results (scores) for all regular season games from season 2015-16 to 2018-19 and for regular season games already played in season 2019-20 (as of December 11, 2019). We would use these data in the model building section.

team_stats = readr::read_csv("./data/team_stats_all.csv")
game_stats = readr::read_csv("./data/game_stats_all.csv")
head(team_stats)
## # A tibble: 6 x 50
##      Rk Tm    Season     G     W     L `W/L%`    MP TeamFG TeamFGA
##   <dbl> <chr> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl>
## 1     1 MIL   2018-~    20    14     6   0.7   4825    894    1835
## 2     2 HOU   2019-~    20    13     7   0.65  4850    815    1823
## 3     3 MIL   2019-~    20    17     3   0.85  4825    878    1830
## 4     4 GSW   2016-~    20    17     3   0.85  4850    873    1738
## 5     5 WAS   2019-~    20     7    13   0.35  4800    898    1884
## 6     6 NOP   2018-~    20    10    10   0.5   4800    888    1858
## # ... with 40 more variables: `TeamFG%` <dbl>, Team2P <dbl>,
## #   Team2PA <dbl>, `Team2P%` <dbl>, Team3P <dbl>, Team3PA <dbl>,
## #   `Team3P%` <dbl>, TeamFT <dbl>, TeamFTA <dbl>, `TeamFT%` <dbl>,
## #   TeamPTS <dbl>, TeamORB <dbl>, TeamDRB <dbl>, TeamTRB <dbl>,
## #   TeamAST <dbl>, TeamSTL <dbl>, TeamBLK <dbl>, TeamTOV <dbl>,
## #   TeamPF <dbl>, OpponentFG <dbl>, OpponentFGA <dbl>,
## #   `OpponentFG%` <dbl>, Opponent2P <dbl>, Opponent2PA <dbl>,
## #   `Opponent2P%` <dbl>, Opponent3P <dbl>, Opponent3PA <dbl>,
## #   `Opponent3P%` <dbl>, OpponentFT <dbl>, OpponentFTA <dbl>,
## #   `OpponentFT%` <dbl>, OpponentPTS <dbl>, OpponentORB <dbl>,
## #   OpponentDRB <dbl>, OpponentTRB <dbl>, OpponentAST <dbl>,
## #   OpponentSTL <dbl>, OpponentBLK <dbl>, OpponentTOV <dbl>,
## #   OpponentPF <dbl>
head(game_stats)
## # A tibble: 6 x 11
##   Date  Start..ET. Visitor.Neutral   PTS Home.Neutral PTS.1 Var.7 Var.8
##   <chr> <chr>      <chr>           <dbl> <chr>        <dbl> <chr> <chr>
## 1 Tue,~ 8:00p      Detroit Pistons   106 Atlanta Haw~    94 Box ~ <NA> 
## 2 Tue,~ 8:00p      Cleveland Cava~    95 Chicago Bul~    97 Box ~ <NA> 
## 3 Tue,~ 10:30p     New Orleans Pe~    95 Golden Stat~   111 Box ~ <NA> 
## 4 Wed,~ 7:00p      Washington Wiz~    88 Orlando Mag~    87 Box ~ <NA> 
## 5 Wed,~ 7:30p      Indiana Pacers     99 Toronto Rap~   106 Box ~ <NA> 
## 6 Wed,~ 7:30p      Charlotte Horn~    94 Miami Heat     104 Box ~ <NA> 
## # ... with 3 more variables: Attend. <dbl>, Notes <chr>, Season <chr>

We do not include the code for web scraping in this rmarkdown file. Instead, please refer to the data_scraping.R file in the source folder for more details. Here we just show the first few rows of the data.

Data Wrangling

Data wrangling is performed throughout the whole data analysis process covering data tidying, exploratory data analysis, feature engineering, and model building. Here we demonstrate some necessary data wrangling procedures performed to prepare the dataset for further analysis.

We list all 30 teams in the league (both full franchise name and abbreviation), and add an indicator to differentiate the 15 teams in the Western Conference and the 15 in the Eastern Conference.

library(rvest)
library(tidyverse)

west_ind = data.frame(
  Tm = c("CHI","LAL","POR","WAS","CHO","MIL","BRK","DET","HOU","NOP","IND","MEM","DEN","LAC","ATL","NYK","MIA","ORL","DAL","PHO",
         "SAC","UTA","MIN","SAS","CLE","TOR","BOS","PHI","GSW","OKC"),
  western = c(0,1,1,0,0,0,0,0,1,1,0,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,0,0,1,1),
  TeamName = c("Chicago Bulls","Los Angeles Lakers","Portland Trail Blazers", "Washington Wizards","Charlotte Hornets",
               "Milwaukee Bucks","Brooklyn Nets" ,"Detroit Pistons","Houston Rockets" ,"New Orleans Pelicans",
               "Indiana Pacers","Memphis Grizzlies","Denver Nuggets","Los Angeles Clippers","Atlanta Hawks",
               "New York Knicks" ,"Miami Heat","Orlando Magic","Dallas Mavericks","Phoenix Suns" ,
               "Sacramento Kings","Utah Jazz", "Minnesota Timberwolves","San Antonio Spurs","Cleveland Cavaliers",
               "Toronto Raptors" ,"Boston Celtics","Philadelphia 76ers" ,"Golden State Warriors","Oklahoma City Thunder")
)

get_ws_data(year): We obtain the win shares stats (estimate of the number of wins contributed by a player) of the players ranked in the top 100 of the league in terms of win shares (WS) in a particular season.

get_ws_data = function(year){
  ws_url = paste0("https://www.basketball-reference.com/play-index/psl_finder.cgi?request=1&match=single&type=totals&per_minute_base=36&per_poss_base=100&season_start=1&season_end=-1&lg_id=NBA&age_min=0&age_max=99&is_playoffs=N&height_min=0&height_max=99&year_min=",
                  year,"&year_max=",year,"&birth_country_is=Y&as_comp=gt&as_val=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=ws")
  ws_stats = ws_url %>% read_html %>%
    html_table() %>%
    .[[1]] 
  ws_stats = ws_stats[,c(2,5,7)]
  colnames(ws_stats)= ws_stats[1,]  
  ws_stats = ws_stats[-c(1,which(ws_stats$Player=='Player')),]
  ws_stats$WS = as.numeric(ws_stats$WS )
  ws_stats_summary = ws_stats %>%
    group_by(Tm,Player) %>%
    summarise(WS_max = max(WS), WS_num = n()) 
  
  return(ws_stats_summary)
}

get_injury_data(year): With basketball being a high impact sport and the NBA being one of the most competitive sports leagues, it is not uncommon for NBA players to experience injuries, some of which will cause players to miss games. We obtain player injury data for players who have missed games due to injury from: https://www.spotrac.com/nba/injured-reserve/ (which sorts the injured players by cash earned while injured). The website provides a separate data table for each season, and we created the combine_injury_data() function to combine injury data from all relevant seasons.

get_injury_data = function(year){
  injury_url = paste0("https://www.spotrac.com/nba/injured-reserve/",
                      year-1)
  injury_stats = injury_url %>% 
    read_html %>%
    html_table() %>%
    .[[4]]
  
  colnames(injury_stats)[1] = "Player"
  
  injury_stats = injury_stats %>%
    separate(Player,sep="\n",into = c("Player","other")) %>%
    mutate(Games = as.numeric(Games),
           Salary = as.numeric(gsub('[$,]', '', injury_stats$`Cash EarnedWhile Injured`))) %>%
    select(Player,Team,Games,Salary) %>%
    filter(Games!='NA') %>%
    mutate(Season = paste0(year-1,'-',substr(year,3,4)))
  return(injury_stats)
}

injury_2020 = get_injury_data(2016)

combine_injury_data = function(){
  year_list = c(2017:2020)
  injury_stats = NULL
  for(i in 1:length(year_list)){
    injury_stats_tmp = get_injury_data(year_list[i])
    injury_stats = rbind(injury_stats, injury_stats_tmp)
  }
  return(injury_stats)
}


injury_stats = combine_injury_data()

Besides player injuries, team rosters of available players to play change also due to NBA trades and transactions where players change teams. We take both player injuries and team transfers into account, and define the adjusted highest-win-shares-on-team metric. This metric equals the win shares of the player with the highest win shares on each team (only considering players with win shares in the top 100 of the league).

For example, in the 2018-19 season, the Los Angeles Clippers (LAC) had four players ranked in the top 100 in terms of WS: Montrezl Harrell (WS 8.7), Danilo Gallinari (WS 8.2), Lou Williams (WS 5.1), and Patrick Beverley (WS 4.8). For the current 2019-20 season, Danilo Gallinari is now playing for the Oklahoma City Thunder (OKC) while LAC has welcomed new team members Paul George (2018-19 WS 11.9) and Kawhi Leonard (2018-19 WS 9.5), so the adjusted highest-win-shares on LAC for the 2019-20 season is 11.9 (WS_max=11.9). We also define the number-of-high-win-shares-on-team metric, which is the number of players on a team who have a win shares that is ranked in the top 100 in the league (from the previous season). Taking LAC for example, the team has five players with a top 100 win shares (WS_num=5).

In addition, players with injuries during a season are left out when defining the team win shares metrics. For example, during the 2018-19 season, Golden State Warriors (GSW) members Stephen Curry, Kevon Looney and Klay Thompson all ranked in the top 100 in terms of WS. However, during the current 2019-20 season, all three players are out due to injury, and are thus not considered to be active players and thus do not contribute to GSW’s team win shares metrics.

get_ws_injury_trans_data = function(year){
  Winshare_last_year = get_ws_data(year-1)
  Winshare_this_year = get_ws_data(year)
  injury_year = get_injury_data(year)
  
  winshare_injury_ind = Winshare_this_year %>%
    left_join(injury_year, by = c("Player","Tm"="Team")) %>%
    mutate(injury_ind = ifelse(is.na(Season),0,1))
  winshare_this_year_injury_trans = winshare_injury_ind %>%
    right_join(Winshare_last_year, by = c("Player")) %>%
    select(Tm.x,Player,Tm.y,WS_max.y,WS_num.y) %>% 
    ungroup() %>% na.omit() %>%
    group_by(Tm.x) %>%
    summarise(WS_max = max(WS_max.y), WS_num = n(),Player = Player[which.max(WS_max.y)]) 
  colnames(winshare_this_year_injury_trans)[1]='Tm'
  return(winshare_this_year_injury_trans)
  
  
}
winshare_feature = get_ws_injury_trans_data(2020)
head(winshare_feature)
## # A tibble: 6 x 4
##   Tm    WS_max WS_num Player          
##   <chr>  <dbl>  <int> <chr>           
## 1 BOS      7.4      4 Kemba Walker    
## 2 BRK      9.1      5 Kyrie Irving    
## 3 CHI      4.7      1 Tomáš Satoranský
## 4 DAL      7.5      2 Dwight Powell   
## 5 DEN     11.8      3 Nikola Jokic    
## 6 DET     10        1 Andre Drummond

get_team_year_feature(year): We create a function that organizes team data in the first 20 games for all teams in a particular season. This process includes extracting from a larger database relevant variables which include number of wins, number of losses, winning percentages, total points scored by team, total points scored by opponents, total team rebounds, total team assists, total team steals, total team blocks, total team turnovers, and total team personal fouls. We also append conference ranking of each team at the end of the first 20 games as well as the WS_max and WS_num metrics we computed above.

get_team_year_feature = function(year){
  
  #team_stats = get_team_stats() # get the team stats for all seasons
  team_stats_year = team_stats %>%
    filter(grepl(as.character(year-1),Season)) # filter the specific year 
  team_stats_year[,-c(2,3)] = as.numeric(as.matrix(team_stats_year[,-c(2,3)])) # turn characters into numeric
  
  
  ## select Win, Loss, Win-Loss per, Team points, Opponent points
  team_stats_year_feature = team_stats_year %>%
    select(Tm, W, L, `W/L%`, TeamPTS, OpponentPTS,TeamTRB,TeamAST,TeamSTL,TeamBLK, TeamTOV,TeamPF) %>%
    inner_join(west_ind, by = 'Tm')
  
  ## assign the team rank
  team_stats_year_rank = team_stats_year_feature %>%
    group_by(western) %>%
    mutate(ConfRank = order(order(W, TeamPTS, decreasing = T)))
  
  ## add the winshare feature from previous season
  Winshare  = get_ws_injury_trans_data(year)
  #print(head(Winshare))
  team_stats_year_rank = team_stats_year_rank %>% left_join(Winshare, by = 'Tm') %>%
    select(-Player)
  team_stats_year_rank$WS_num[is.na(team_stats_year_rank$WS_num)] = 0
  team_stats_year_rank$WS_max[is.na(team_stats_year_rank$WS_max)] = 0
  
  return(team_stats_year_rank)
  
}

team_stats_18to19 = get_team_year_feature(2019)
head(team_stats_18to19)
## # A tibble: 6 x 17
## # Groups:   western [2]
##   Tm        W     L `W/L%` TeamPTS OpponentPTS TeamTRB TeamAST TeamSTL
##   <chr> <dbl> <dbl>  <dbl>   <dbl>       <dbl>   <dbl>   <dbl>   <dbl>
## 1 MIL      14     6   0.7     2420        2199    1012     545     138
## 2 NOP      10    10   0.5     2375        2345     966     540     140
## 3 LAC      14     6   0.7     2335        2236     945     443     117
## 4 TOR      16     4   0.8     2327        2149     945     512     174
## 5 GSW      13     7   0.65    2318        2208     907     564     147
## 6 CHO      10    10   0.5     2304        2215     887     495     144
## # ... with 8 more variables: TeamBLK <dbl>, TeamTOV <dbl>, TeamPF <dbl>,
## #   western <dbl>, TeamName <fct>, ConfRank <int>, WS_max <dbl>,
## #   WS_num <dbl>

get_game_year_feature(year): We also create a function that organizes data of all regular season games in a particular season. This process includes extracting from a larger database relevant variables which include date of game, name of home team and visiting team, and points scored by home team and visiting team. We also append an indicator variable indicating whether the game was won by the home team.

get_game_year_feature = function(year){
  game_stats_year = game_stats %>% 
    filter(grepl(as.character(year-1),Season))
  ## game stats 
  ## the results of every game in the season 
  game_stats_feature = game_stats_year[,c(1,5,6,3,4)] 
  game_stats_feature[,c(3,5)] = as.numeric(as.matrix(game_stats_feature[,c(3,5)]))
  colnames(game_stats_feature) = c("Date","Home","HomePTS","Away","AwayPTS")
  game_stats_feature = game_stats_feature %>%
    mutate(HomeWin = ifelse(HomePTS > AwayPTS,1,0))
  return(game_stats_feature)
}
game_stats_18to19 = get_game_year_feature(2019)
head(game_stats_18to19)
## # A tibble: 6 x 6
##   Date           Home              HomePTS Away             AwayPTS HomeWin
##   <chr>          <chr>               <dbl> <chr>              <dbl>   <dbl>
## 1 Tue, Oct 16, ~ Boston Celtics        105 Philadelphia 76~      87       1
## 2 Tue, Oct 16, ~ Golden State War~     108 Oklahoma City T~     100       1
## 3 Wed, Oct 17, ~ Charlotte Hornets     112 Milwaukee Bucks      113       0
## 4 Wed, Oct 17, ~ Detroit Pistons       103 Brooklyn Nets        100       1
## 5 Wed, Oct 17, ~ Indiana Pacers        111 Memphis Grizzli~      83       1
## 6 Wed, Oct 17, ~ Orlando Magic         104 Miami Heat           101       1

Exploratory Data Analysis

Team-level

We compared team points per game and opponent points per game of all teams by conference, and explored how these stats were related to conference standings. In general, a team that is strong in offense will have a relatively high team score per game, and a team that is strong in defense will have a relatively low opponent score per game. From the plot below, we can see that some highly ranked teams are strong both offensively and defensively (e.g. TOR), while some teams make up for their weaker offense through strong defense (e.g. IND). Thus, team points or opponent points alone may not be sufficient to quantify the strength of a team, but the use of both stats or the net score (team points - opponent points) may be more appropriate.

team_def_off_plot = function(year){
  team_stats_year = get_team_year_feature(year)
  conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")
  p = team_stats_year %>% 
    ggplot(aes(x = TeamPTS/20, y = OpponentPTS/20))+
    geom_point(aes(size = -ConfRank, col = factor(western)))+
    geom_text(aes(label=Tm),vjust = 0, nudge_y = 0.5)+
    geom_abline(slope = 1,intercept = 0,linetype="dashed", color = "red",size = 1)+
    facet_grid(~western, labeller = as_labeller(conference_names))+
    labs(title = paste0('Points Scored in First 20 games (Season ',year-1,'-',substr(year,3,4),")"))+
    theme(plot.title = element_text(hjust = 0.5))+
    coord_fixed()+
    theme(legend.position = "bottom") +
    guides(colour=FALSE) +
    scale_size("Conference standing after 20 games",breaks=c(-15,-10,-5),labels=c(15,10,5))+
    xlab("Team Points (average per game)")+
    ylab("Opponent Points (average per game)")
  print(p)
}
#png("./plots/team_def_off_plot.png",width = 960,height = 480)
team_def_off_plot(2019)

#dev.off()

We next explored how the standings and W-L records of teams after 20 regular games compared to the final standings and records at the end of the regular season. The plots of conference standings show that many teams that performed well in the beginning of the regular season continued to perform well throughout the season (e.g. GSW & DEN), while some teams that were not faring well earlier on finished the season strong (e.g. HOU & UTA), and still other teams that were in the top 8 after 20 regular season games did not make it to the playoffs in the end (e.g. MEM & LAL). In the team records plot, the line shows the expected number of regular season wins if teams were to maintain the same first-20-game-winning-percentage throughout the regular season. The bandwidth represents the 95% confidence interval for predictions of regular season wins. We can see that while some teams had a consistent performance throughout the season (e.g. MIN & OKC), other teams had better (e.g. HOU) or worse (e.g. MEM) performance in the remaining 62 games. Thus, we learn that even if a team starts the season poorly, there is still an oppurtunity for the team to improve and ultimately reach the playoffs.

WL_record_cal = function(game_schedule,WL_record){
  for(i in 1:nrow(game_schedule)){
    if (game_schedule$HomeWin[i]==1){
      WL_record$W[which(WL_record$TeamName==game_schedule$Home[i])] = 
        WL_record$W[which(WL_record$TeamName==game_schedule$Home[i])] + 1
      WL_record$L[which(WL_record$TeamName==game_schedule$Away[i])] = 
        WL_record$L[which(WL_record$TeamName==game_schedule$Away[i])] + 1
    }
    
    if (game_schedule$HomeWin[i]==0){
      WL_record$L[which(WL_record$TeamName==game_schedule$Home[i])] = 
        WL_record$L[which(WL_record$TeamName==game_schedule$Home[i])] + 1
      WL_record$W[which(WL_record$TeamName==game_schedule$Away[i])] = 
        WL_record$W[which(WL_record$TeamName==game_schedule$Away[i])] + 1
    }
  }
  return(WL_record)
}

team_rank_plot = function(year, Nfirst = 300){
  team_stats_year = get_team_year_feature(year)
  game_stats_year = get_game_year_feature(year)
  colnames(team_stats_year)[c(2,3)] = c("W.pre","L.pre")
  
  WL_record = cbind(west_ind,W = 0, L=0)
  WL_record  = WL_record_cal(game_stats_year,WL_record)
  colnames(WL_record)[c(4,5)] = c("W.all","L.all")
  WL_record = WL_record %>%
    group_by(western) %>%
    mutate(Season_Rank = order(order(W.all, decreasing = T))) %>%
    inner_join(team_stats_year[,c(1,2,3,14,15)],by=c("Tm","TeamName"))
  WL_record = WL_record %>%
    mutate(Playoff = Season_Rank <= 8)
  conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")
  p = WL_record %>%
    ggplot(aes(x = ConfRank, y = Season_Rank))+
    geom_point(aes(col = factor(western), size = factor(Playoff)))+
    geom_text(aes(label=Tm),vjust = 0, nudge_y = 0.5)+
    geom_abline(slope = 1,intercept = 0,linetype="dashed", color = "red",size = 1)+
    geom_vline(xintercept=8, linetype="dotted", color = "blue",size = 1)+
    facet_grid(~western, labeller = as_labeller(conference_names))+
    labs(title = paste0('Conference Standings (Season ',year-1,'-',substr(year,3,4),")"))+
    theme(plot.title = element_text(hjust = 0.5))+
    coord_fixed()+
    theme(legend.position = "bottom")+
    xlab("Conference standing after 20 regular season games")+
    ylab("Final conference standing after all 82 regular season games")+
    guides(colour=FALSE, size=guide_legend(title="Playoff team"))+    
    scale_size_discrete(range = c(1.5, 4))
  print(p)
  
  p2 = WL_record %>%
    ggplot(aes(x = W.pre, y =W.all))+
    geom_point(aes(col = factor(western),size = factor(Playoff)))+
    geom_smooth(method = "lm")+ 
    geom_text(aes(label=Tm),vjust = 0, nudge_y = 0.5)+
    facet_grid(~western, labeller = as_labeller(conference_names))+
    labs(title = paste0('Team Records (Season ',year-1,'-',substr(year,3,4),")"))+
    theme(plot.title = element_text(hjust = 0.5))+
    theme(legend.position = "bottom")+
    xlab("Number of wins in first 20 regular season games")+
    ylab("Number of wins in all 82 regular season games")+
    guides(colour=FALSE, size=guide_legend(title="Playoff team"))+
    scale_size_discrete(range = c(1.5, 4))
  print(p2)
}
#png("./plots/team_rank_plot.png",width = 960,height = 480)
#team_rank_plot(2019)
#dev.off()

#png("./plots/team_win_plot.png",width = 960,height = 480)
team_rank_plot(2019)

#dev.off()

In recent years, teams are shooting more 3-pointers than ever before. According to records from Basketball Reference, the total number of 3-pointers attempted by the league has increased each of the past seven seasons from 36,395 in the 2011-12 season to 78,742 in the 2018-19 season (https://www.basketball-reference.com/leagues/NBA_stats_totals.html). We were wondering whether teams that were more skilled in shooting 3-pointers or were better at defending 3-point attempts from opponents were able to perform better in the league. From the exploratory plots below, we cannot see obvious patterns of how volume or accuracy of 3-point shot attempts by teams are related to team winning percentages. The same applies to 2-point shot attempts.

WL_record = cbind(west_ind,W = 0, L=0)
WL_record  = rbind(WL_record_cal(get_game_year_feature(2019),WL_record),WL_record_cal(get_game_year_feature(2018),WL_record),
                   WL_record_cal(get_game_year_feature(2017),WL_record),WL_record_cal(get_game_year_feature(2016),WL_record))
colnames(WL_record)[c(4,5)] = c("W.all","L.all")
WL_record = WL_record %>%
  group_by(western) %>%
  mutate(Season_Rank = order(order(W.all, decreasing = T)))
WL_record = WL_record %>%
  mutate(Playoff = Season_Rank <= 8)
conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")

# plots to see if 2point, 3point are good indicators of W/L%
team_stats_new <- merge(team_stats,west_ind,by="Tm")
p_3_1plot <-  team_stats_new %>%
  ggplot(aes(x = `Team3PA`, y=`Opponent3PA`))+
  geom_point(aes(col = team_stats_new$`W/L%`))+
  scale_color_gradient(low="red", high="green",name = "W/L%")+
  facet_grid(~western, labeller = as_labeller(conference_names))+
  labs(title = "Team 3-point Shot Attempts and W/L% (Since Season 2015-16)")+
  xlab("Total 3-point shot attempts by team")+
  ylab("Total 3-point shot attempts by opponent")
p_3_1plot

p_3_2plot <-  team_stats_new %>%
  ggplot(aes(x = `Team3P%`, y=`Opponent3P%`))+
  geom_point(aes(col = team_stats_new$`W/L%`))+
  scale_color_gradient(low="red", high="green",name = "W/L%")+
  facet_grid(~western, labeller = as_labeller(conference_names))+
  labs(title = "Team 3-point Accuracy and W/L% (Since Season 2015-16)")+
  xlab("Percentage of attempted 3-point shots made by team")+
  ylab("Percentage of attempted 3`-point shots made by opponent")
p_3_2plot

team_stats_new <- merge(team_stats,west_ind,by="Tm")
p_2_1plot <-  team_stats_new %>%
  ggplot(aes(x = `Team2PA`, y=`Opponent2PA`))+
  geom_point(aes(col = team_stats_new$`W/L%`))+
  scale_color_gradient(low="red", high="green",name = "W/L%")+
  facet_grid(~western, labeller = as_labeller(conference_names))+
  labs(title = "Team 2-point Shot Attempts and W/L% (Since Season 2015-16)")+
  xlab("Total 2-point shot attempts by team")+
  ylab("Total 2-point shot attempts by opponent")
p_2_1plot

p_2_2plot <-  team_stats_new %>%
  ggplot(aes(x = `Team2P%`, y=`Opponent2P%`))+
  geom_point(aes(col = team_stats_new$`W/L%`))+
  scale_color_gradient(low="red", high="green",name = "W/L%")+
  facet_grid(~western, labeller = as_labeller(conference_names))+
  labs(title = "Team 2-point Accuracy and W/L% (Since Season 2015-16)")+
  xlab("Percentage of attempted 2-point shots made by team")+
  ylab("Percentage of attempted 2-point shots made by opponent")
p_2_2plot

Specific effect

Home-court advantage

The NBA regular season consists of each team playing 82 games, 41 at home and 41 away. We examined home and away records of teams to better understand home-court advantage in the NBA. We can see that most teams have a better record playing at home with a few exceptions (i.e. CHI & MIA). The home-court advantage is more prominent for some teams (e.g. DEN & SAS) while mattering less for other teams (e.g. GSW & OKC).

home_away_plot = function(year){
  game_stats_year = get_game_year_feature(year)
  standing_url = paste0("https://www.basketball-reference.com/leagues/NBA_",
                        year,"_standings.html")
  
  standing_year = standing_url %>% 
    read_html %>%
    html_nodes(xpath = '//comment()') %>%    # select comment nodes
    html_text() %>%    # extract comment text
    paste(collapse = '') %>%    # collapse to a single string
    read_html() %>%    # reparse to HTML
    html_table() %>%
    .[[1]]  
  
  colnames(standing_year) = paste0(colnames(standing_year),standing_year[1,])
  standing_year = standing_year[-1,]
  
  standing_table = standing_year %>%
    select(Team,Overall,PlaceHome,PlaceRoad,ConferenceE,ConferenceW) %>%
    separate(PlaceHome,into = c("HomeW","HomeL"),sep = "-") %>%
    separate(PlaceRoad,into = c("AwayW","AwayL"),sep = "-") %>%
    separate(ConferenceE,into = c("EastW","EastL"),sep = "-") %>%
    separate(ConferenceW,into = c("WestW","WestL"),sep = "-") %>%
    separate(Overall,into = c("Wtot","Ltot"),sep = "-") %>%
    inner_join(west_ind,by=c("Team" = "TeamName"))
  
  standing_table_gather = standing_table %>%
    gather(stats,win,-c("Team","Wtot","Ltot","Tm","western"))
  standing_table_gather[,c(2,3,7)] = as.numeric(as.matrix(standing_table_gather[,c(2,3,7)]))
  conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")
  p = standing_table_gather %>% 
    filter(stats %in% c("HomeW","AwayW")) %>%
    ggplot(aes(x = reorder(Tm,Wtot), y = win,fill= stats))+
    geom_bar(stat = "identity",position = "dodge")+
    facet_grid(~western,scales="free_x", labeller = as_labeller(conference_names))+
    labs(title = paste0('Team Records (Season ',year-1,'-',substr(year,3,4),")"))+
    theme(plot.title = element_text(hjust = 0.5))+
    theme(legend.position = "bottom")+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
    xlab("Teams")+
    ylab("Number of wins")+
    scale_fill_manual(values = c("#F8766D", "#00BFC4"), labels = c("Away wins", "Home wins"))+
    theme(legend.title = element_blank())
  print(p)
  
  team_factor_levels= standing_table_gather %>% 
    filter(western==1 & stats %in% c('EastW') | western==0 & stats %in% c('WestW')) %>%
    arrange(win) %>%
    select(Tm) %>% as.matrix() %>% as.character()
  
  p2 = standing_table_gather %>%
    mutate(Tm = factor(Tm,levels = team_factor_levels)) %>%
    filter(western==1 & stats %in% c('EastW','EastL') | western==0 & stats %in% c('WestW','WestL')) %>%
    ggplot(aes(x =Tm, y = win,fill= stats))+
    geom_bar(stat = "identity")+
    labs(title = paste0('Team Records (Season ',year-1,'-',substr(year,3,4),")"))+
    theme(plot.title = element_text(hjust = 0.5))+
    theme(legend.position = "bottom")+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
    xlab("Teams")+
    ylab("Number of games")+
    scale_fill_manual(values = c("#F8766D","#7CAE00","#00BFC4","#C77CFF"), labels = c("Loss against East","Win against East","Loss against West","Win against West"))+
    theme(legend.title = element_blank())
  
  return(p2)
}

p2 <- home_away_plot(2019)

#png("./plots/home_away.png",width = 960,height = 480)
#p2
#dev.off()

Western-Eastern effect

The regular season schedule allows for each team to play the 15 teams from the opposing conference twice a year, so a total of 30 games are played against opposing conference teams. We often hear that the Western Conference is better than the Eastern Conference, but we wondered how true this is. We examined records of matches where teams from opposing conferences faced each other. From the plot, we can see that more teams from the Western Conference (teams with bars colored green and pink) lie to the right end of the plot, which indicates that these teams have high winning percentages against teams from the opposing conference. So the Western Conference does seem to be the stronger conference.

#png("./plots/west_east.png",width = 960,height = 480)
#p2
#dev.off()
p2

Winning or losing streak effect

We next considered whether a team’s performance in a game is dependent on the results of the team’s previous games. Could a winning streak boost a team’s confidence? Or does a winning streak mean a team has been playing extremely hard on the court and will eventually suffer from greater fatigue? To investigate the impact of previous match results, we looked at match records of teams in the game following win or loss streaks of different lengths. In the 2018-19 season, the longest winning streak was 11 games by the Golden State Warriors and the longest losing streak was 18 games by the New York Knicks. For each team, we applied the Chi-Square test of independence to assess the relationship between streaks and results of the following match. Test results indicated that a statistically significant relationship existed only for the Philadelphia 76ers. A boxplot is used to visualize how the overall performances of all teams in the league vary after different streaks (only winning or losing streaks of up to 3 are shown since other longer streaks are much less common).

library(data.table)
library(rowr)

team_multiple_game_data = function(year){
  teams <- unique(game_stats$Visitor.Neutral)
  game_after_streak <- list()
  game_consecutive <- list()
  
  for (team in teams) {
    ### wrangling team winning streak data
    team_game_stats <- game_stats %>% filter(grepl(as.character(year-1), Season))
    team_game_stats <- team_game_stats %>% filter(Visitor.Neutral == team | Home.Neutral == team)
    team_game_stats <- team_game_stats[complete.cases(team_game_stats[, "PTS"]),]
    # compute winning streaks
    team_game_stats$isWin <- 0
    team_game_stats[team_game_stats$Home.Neutral == team & (team_game_stats$PTS.1 > team_game_stats$PTS), "isWin"] = 1
    team_game_stats[team_game_stats$Visitor.Neutral == team & (team_game_stats$PTS.1 < team_game_stats$PTS), "isWin"] = 1
    team_game_stats$preGameWinStreak <- rowid(rleid(team_game_stats$isWin)) * team_game_stats$isWin
    na.omit(setDT(team_game_stats)[, preGameWinStreak:=c(NA, preGameWinStreak[-.N])])
    # compute losing streaks
    team_game_stats$isLose <- 1 - team_game_stats$isWin
    team_game_stats$preGameLoseStreak <- rowid(rleid(team_game_stats$isLose)) * team_game_stats$isLose
    na.omit(setDT(team_game_stats)[, preGameLoseStreak:=c(NA, preGameLoseStreak[-.N])])
    # overall streak
    team_game_stats$streak <- paste0("W", team_game_stats$preGameWinStreak)
    team_game_stats[team_game_stats$streak == "W0", "streak"] <- paste0("L", team_game_stats$preGameLoseStreak[team_game_stats$streak == "W0"])
    # game results after streaks
    game_after_streak[[team]] <- as.data.frame.matrix(table(team_game_stats$streak, team_game_stats$isWin))
    
    ### wrangling back-to-back game data
    team_game_stats$GameDate <- as.Date(substring(team_game_stats$Date, 6), "%b %d, %Y")
    team_game_stats$consecutive <- c(NA,diff(as.Date(team_game_stats$GameDate)) == 1)
    # game results after back-to-back game
    game_consecutive[[team]] <- as.data.frame.matrix(table(team_game_stats$consecutive, team_game_stats$isWin))
  }
  return(list("game_after_streak"=game_after_streak, "game_consecutive"=game_consecutive))
  
}

team_multiple_game_data_2019 <- team_multiple_game_data(2019)


# chi-square test for independence
tests<-lapply(team_multiple_game_data_2019$game_after_streak, chisq.test)
names(which(sapply(tests, function(f) f$p.value) < 0.05))
## [1] "Philadelphia 76ers"
# merge winning streak data of all teams
all_team_streaks <- bind_rows(team_multiple_game_data_2019$game_after_streak, .id = "team")
all_team_streaks$Streak <- unlist(lapply(team_multiple_game_data_2019$game_after_streak, rownames))
colnames(all_team_streaks)[2:3] <- c("lose","win")
all_team_streaks$winProb <- all_team_streaks$win/(all_team_streaks$win + all_team_streaks$lose)

# plot 
all_team_streaks_plot <- all_team_streaks  %>% filter(Streak %in% c("L3","L2","L1","W1","W2","W3"))
all_team_streaks_plot$Streak <- factor(all_team_streaks_plot$Streak,levels = c("L3","L2","L1","W1","W2","W3"))
ggplot(data=all_team_streaks_plot, aes(x=Streak, y=winProb)) +
  geom_boxplot(show.legend = F) +
  labs(title = "Team Performances After Winning or Losing Streak (Season 2018-19)") +
  ylab("Proportion of game following streak won")

Fatigue effect

The NBA regular season is packed into a tight schedule of 82 games for each team, so sometimes a team finds itself having to play on consecutive (back-to-back) nights, leaving less time for rest and recovery in between games, which may take a toll on the players. We were wondering whether as a result, teams perform worse if facing a back-to-back schedule. Using game records from the 2018-19 season, we see that the average proportion of games won by all teams in the league when the game is a back-to-back game is somewhat lower compared to the winning proportion if the teams had extra days to rest. Chi-Square tests of independence indicated that a statistically significant relationship between whether a game was played back-to-back and whether the game was won or lost existed only for the Washington Wizards.

# merge and plot back-to-back game data of all teams
all_team_consecutive <- bind_rows(team_multiple_game_data_2019$game_consecutive, .id = "team")
all_team_consecutive$consecutive <- unlist(lapply(team_multiple_game_data_2019$game_consecutive, rownames))
colnames(all_team_consecutive)[2:3] <- c("lose","win")
all_team_consecutive$winProb <- all_team_consecutive$win/(all_team_consecutive$win + all_team_consecutive$lose)
ggplot(data=all_team_consecutive, aes(x=consecutive, y=winProb)) +
  geom_boxplot(show.legend = F) +
  labs(title = "Team Performances In Back-to-back Games (Season 2018-19)") +
  ylab("Proportion of game won") +
  xlab("Back-to-back game or not")

# chi-square test for independence
tests<-lapply(team_multiple_game_data_2019$game_consecutive, chisq.test)
names(which(sapply(tests, function(f) f$p.value) < 0.05))
## [1] "Washington Wizards"

Player-level

Basketball is a team sport, but the value of individual talent is not to be overlooked. Win shares is a metric that is designed to estimate an individual player’s contribution to the team in terms of wins. We identify the win shares of the player with the most win shares on each team (only considering players with win shares in the top 100 of the league). We can see that teams that have a dominant player tend to perform better. Note that some teams are missing from the plot because none of the players on the team is ranked in the top 100.

team_ws_plot = function(year){
  team_stats_year = get_team_year_feature(year)
  game_stats_year = get_game_year_feature(year)
  colnames(team_stats_year)[c(2,3)] = c("W.pre","L.pre")
  
  WL_record = cbind(west_ind,W = 0, L=0)
  WL_record  = WL_record_cal(game_stats_year,WL_record)
  colnames(WL_record)[c(4,5)] = c("W.all","L.all")
  WL_record = WL_record %>%
    group_by(western) %>%
    mutate(Season_Rank = order(order(W.all, decreasing = T))) %>%
    inner_join(team_stats_year[,c(1,2,3,14,15)],by=c("Tm","TeamName"))
  
  
  Winshare_feature = get_ws_injury_trans_data(year)
  WL_record_ws = WL_record %>%
    inner_join(Winshare_feature, by='Tm')
  conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")
  
  WL_record_ws <- WL_record_ws %>% 
    group_by(western) %>%
    ungroup() %>%
    arrange(western, desc(Season_Rank)) %>%
    mutate(order = row_number())
  p = WL_record_ws %>%
    ggplot(aes(WS_max, order))+
    geom_segment(aes(x = 0, 
                     y = order, 
                     xend = WS_max, 
                     yend = order), 
                 color = "black")+
    geom_point(aes(col = factor(western)),size = 7)+
    geom_text(aes(label=WS_max),color="white",size=2.5)+
    facet_wrap(~ factor(western), scales="free", labeller = as_labeller(conference_names))+
    labs(title = paste0('Win Shares on Teams (Season ',year-1,'-',substr(year,3,4),")"))+
    theme(plot.title = element_text(hjust = 0.5))+
    theme(legend.position = "bottom")+
    guides(colour=FALSE)+
    xlab("Win shares of player with most win shares on each team")+
    ylab("End of regular season conference standings")+
    scale_y_continuous(
      breaks = WL_record_ws$order,
      labels = paste0(WL_record_ws$Season_Rank,". ",WL_record_ws$Tm)
    ) 
  print(p)
  
}
#png("./plots/team_ws_plot.png",width = 960,height = 480)
team_ws_plot(2019)

#dev.off()

We next examine how players’ win shares vary over NBA seasons. We identify the 20 players in the league who have had among the highest single-season win shares during the 2014-15 to 2018-19 season timeframe, and examine how the win shares of these players varied over those five seasons. We see that the win shares of these top players are fairly consistent over seasons. Never does one of these player fail to place in the top 100 in the league in terms of win shares, and over half of these players have consistently made the top 20 list.

get_ws_raw_data = function(year){
  ws_url = paste0("https://www.basketball-reference.com/play-index/psl_finder.cgi?request=1&match=single&type=totals&per_minute_base=36&per_poss_base=100&season_start=1&season_end=-1&lg_id=NBA&age_min=0&age_max=99&is_playoffs=N&height_min=0&height_max=99&year_min=",
                  year,"&year_max=",year,"&birth_country_is=Y&as_comp=gt&as_val=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=ws")
  ws_stats = ws_url %>% read_html %>%
    html_table() %>%
    .[[1]] 
  ws_stats = ws_stats[,c(2,3,5,7)]
  colnames(ws_stats)= ws_stats[1,]  
  ws_stats = ws_stats[-c(1,which(ws_stats$Player=='Player')),]
  ws_stats$WS = as.numeric(ws_stats$WS )
  return(ws_stats)
}


player_ws_plot = function(){
  year_list = c(2015:2019)
  ws_stats  = NULL
  for(i in 1: length(year_list)){
    ws_tmp = get_ws_raw_data(year_list[i])
    ws_stats = rbind(ws_stats, ws_tmp)
  }
  ws_stats = ws_stats %>% arrange(-WS)
  player_rank_list = unique(ws_stats$Player)
  
  ws_stats_rank = ws_stats %>%
    group_by(Season) %>%
    mutate(WSRank = rank(-WS,ties.method = "first"))
  
  p = ws_stats_rank %>%
    filter(Player %in% player_rank_list[1:20]) %>%
    ggplot(aes(x = Player, y = WSRank,fill = Player))+
    geom_boxplot()+
    labs(title = "Consistency of Player Win Shares Over 2014-15 to 2018-19 Season")+
    theme(plot.title = element_text(hjust = 0.5))+
    theme(legend.position = "none")+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))+
    ylab("Win shares ranking in league")
  print(p)
}
#png("./plots/player_ws_plot.png",width = 960,height = 480)
player_ws_plot()

#dev.off()

Feature Engineering

The performance of a team could be captured by both team-level and player-level statistics.
The team-level features included in our current model are:

  • Number of wins (W) and losses (L), W/L percentage, conference rank
  • Team total/average points, opponent total/average points
  • Other secondary statistics: total rebounds, assists, blocks, turnovers, personal fouls

The player-level features that are incorporated in the team performance metrics include:

  • Number of league top 100 win share players on the team
  • Highest win share value on the team

PageRank Algorithm

library(igraph)
library(tcltk)
library(dplyr)
#dat<-read.csv("./data/game_stats.csv",stringsAsFactors = F)
dat <- game_stats %>%
  filter(Season=='2018-19')



dat$raw_points<-abs(dat$PTS.1-dat$PTS)
dat$win<-ifelse(dat$PTS.1-dat$PTS>0,dat$Home.Neutral,dat$Visitor.Neutral)
dat$lose<-ifelse(dat$PTS.1-dat$PTS>0,dat$Visitor.Neutral,dat$Home.Neutral)
dat_igraph<-data.frame(A=dat$lose,B=dat$win,weight=dat$raw_points)
dat_igraph<-dat_igraph%>%group_by(A,B)%>%summarise(weight=mean(weight))
dat_graph<-graph.data.frame(d=dat_igraph,directed = TRUE)
is_weighted(dat_graph)
# plot.igraph(dat_graph)
#tkplot(dat_graph)
adj_matrix<-get.adjacency(dat_graph,attr="weight")
rank <- sort(page.rank(dat_graph,directed=T)$vector,decreasing = T)

##first twenty
dat_igraph_20<-data.frame(A=dat$lose[1:300],B=dat$win[1:300],weight=dat$raw_points[1:300])
dat_igraph_20<-dat_igraph_20%>%group_by(A,B)%>%summarise(weight=mean(weight))
dat_graph_20<-graph.data.frame(d=dat_igraph_20,directed = TRUE)
is_weighted(dat_graph_20)
#tkplot(dat_graph)
adj_matrix<-get.adjacency(dat_graph_20,attr="weight")
rank <- sort(page.rank(dat_graph_20,directed=T)$vector,decreasing = T)

#plot
west_ind_sort<-west_ind[match(get.vertex.attribute(dat_graph_20,"name"), west_ind$TeamName),]
dat_graph_20<-set_vertex_attr(dat_graph_20,"western",value=west_ind_sort$western)
dat_graph_20<-set_vertex_attr(dat_graph_20,"Tm",value=as.character(west_ind_sort$Tm))
colrs<-c("gray50","tomato")
V(dat_graph_20)$color <- colrs[as.factor(V(dat_graph_20)$western)]
deg <- 2*degree(dat_graph_20, mode="in")
plot.igraph(dat_graph_20,
            edge.arrow.size=0.3,
            vertex.color=colrs[as.factor(V(dat_graph_20)$western)],
            vertex.label=V(dat_graph_20)$Tm,
            vertex.size=deg,
            edge.width=4*E(dat_graph_20)$weight/max(E(dat_graph_20)$weight))

Data Analysis

Statistical Model: Generalized Bradley-Terry Logistic Regression Model

We used a generalized Bradley-Terry logistic regression model as the basis to generate winning probabilities of teams. We chose the logit model over more complex machine learning models for the following reasons:

  • The logit model (regression coefficients) is easier to interpret than machine learning models.
  • The logit model can achieve similar performance compared to tree-based ML models.
  • The training of a logit model is fast.
  • Use of the logit model does not require parameter tuning or cross-validation.

Model Formulation

\[ logit(Pr(i \succ j \ \text{in game} \ k )) = \lambda_i - \lambda_j + \delta \cdot I(i \in adv)+ \gamma^{\top} z_{ij} \]

where \(\lambda_i\) and \(\lambda_j\) are the team performance vectors defined based on feature engineering. \(I(I \in adv)\) is the indicator for home-court advantage, and \(z_{ij}\) are some potential effects not included in the team performance, such as western-eastern effect.

Evaluation Metric

Cross-Entropy Loss: \[ \mathcal{L} = - \sum_{i=1}^N y_i log(\hat{p_i})+(1-y_i) log(1-\hat{p_i}) \]

Model Checking

Since there is overdispersion in the data, we use a quasibinomial model instead with dispersion parameter \(\approx\) 1.

Model Training and Testing

We used data from the first 20 games of each team (total of 300 games combined) of the 2015-16 to 2017-18 seasons as the training set and the first 20 games of each team of the 2018-19 season as the test set. A generalized Bradley-Terry quasibinomial model was fit using the data. The model takes two teams into account at a time. Features selected (refer to Feature Engineering section above) are stored as vectors \(\lambda_i\) and \(\lambda_j\) for each team. The Bradley-Terry model is parametrized by the differences in terms of PTS (points), TRB (total rebounds), AST (assists), STL (steals), BLK (blocks), TOV (turnovers) and PF (personal fouls) between each of the two teams, plus an indicator of home-court advantage, and other trivial effects such as western-eastern effect. Cross-entropy loss was used to validate the model. The model would generate the probability of winning and losing for each of the two teams. If a team has a probability of winning that is greater than 0.5, we would regard the team as the winner, and vice versa.

Model Interpretation

In the following example, the home-court indicator (which is the intercept term in model summary) is statistically significant(p-value<1e-8). Team total personal fouls (p-value = 0.0251) and number of high winshare players (p-value = 0.028) are two other factors that contribute to winning. To interpret how the number of high winshare players could affect a game, if we increase one high winshare player in a team and hold other covariates unchanged, the odds of winning would be \(exp(0.2314) = 1.26\) (95% CI: (1.03, 1.55)) times of the original odds.

Prediction Results

For illustration purposes, if we use the first 300 games of the 2016-17 season as the training set and the first 300 games of the 2017-18 season as the test set (each team having played approximately 20 games), then we get a prediction accuracy of 0.678 ([0.647,0.708]) with a sensitivity of 0.52 and specificity of 0.79.

library(caret)

get_train_test_data = function(year, firstN){
  team_stats_year = get_team_year_feature(year)
  game_stats_year = get_game_year_feature(year)
  
  train_set = game_stats_year[-c(1:firstN),]
  feature_idx = -c(1,3,4,14)
  team_per_diff = data.frame(team_stats_year[match(train_set$Home,team_stats_year$TeamName),feature_idx]
                             - team_stats_year[match(train_set$Away,team_stats_year$TeamName),feature_idx])
  train_set_teamper = data.frame(train_set,
                                 team_per_diff = I(as.matrix(team_per_diff)))
  
  return(list(data_logit = train_set_teamper, data_ML = data.frame(cbind(train_set, team_per_diff))))
}



build_logit_model_year = function(train_year = 2018, test_year = 2019, firstN = 300){
  
  train_set = get_train_test_data(train_year,firstN)$data_logit
  test_set = get_train_test_data(test_year,firstN)$data_logit
  
  print(colnames(train_set))
  
  ## logit model
  logit_model = glm(HomeWin~ team_per_diff ,family=quasibinomial(link='logit'),
                    data = train_set)
  print(summary(logit_model)) # model summary
  
  ## training accuracy
  
  print(confusionMatrix(factor(ifelse(logit_model$fitted.values>0.5,1,0)),
                        factor(train_set$HomeWin)))
  
  ## testing accuracy
  pred_win_prob = predict(logit_model,newdata = test_set, type = "response")
  pred_game = ifelse(pred_win_prob>0.5,1,0)
  
  test_set_pred = cbind(test_set,pred_win_prob,pred_game)
  print(confusionMatrix(factor(test_set_pred$pred_game),factor(test_set_pred$HomeWin)))
  return(test_set_pred)
}

test_set_pred = build_logit_model_year(2017,2018)
## [1] "Date"          "Home"          "HomePTS"       "Away"         
## [5] "AwayPTS"       "HomeWin"       "team_per_diff"
## 
## Call:
## glm(formula = HomeWin ~ team_per_diff, family = quasibinomial(link = "logit"), 
##     data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1058  -1.1285   0.6468   0.9682   1.9771  
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.4174283  0.0720854   5.791 9.62e-09 ***
## team_per_diffW           -0.0117562  0.0824854  -0.143   0.8867    
## team_per_diffTeamPTS     -0.0008002  0.0017669  -0.453   0.6507    
## team_per_diffOpponentPTS -0.0022329  0.0017428  -1.281   0.2005    
## team_per_diffTeamTRB     -0.0014532  0.0018226  -0.797   0.4255    
## team_per_diffTeamAST      0.0022276  0.0024742   0.900   0.3682    
## team_per_diffTeamSTL      0.0004728  0.0040842   0.116   0.9079    
## team_per_diffTeamBLK     -0.0012661  0.0052851  -0.240   0.8107    
## team_per_diffTeamTOV      0.0034292  0.0025849   1.327   0.1850    
## team_per_diffTeamPF       0.0046855  0.0020886   2.243   0.0251 *  
## team_per_diffwestern      0.1196188  0.1345914   0.889   0.3744    
## team_per_diffConfRank    -0.0167029  0.0563909  -0.296   0.7671    
## team_per_diffWS_max       0.0435573  0.0326032   1.336   0.1819    
## team_per_diffWS_num       0.2314029  0.1051208   2.201   0.0280 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 1.006972)
## 
##     Null deviance: 1258.0  on 929  degrees of freedom
## Residual deviance: 1139.6  on 916  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 178 110
##          1 202 440
##                                           
##                Accuracy : 0.6645          
##                  95% CI : (0.6331, 0.6948)
##     No Information Rate : 0.5914          
##     P-Value [Acc > NIR] : 2.668e-06       
##                                           
##                   Kappa : 0.2789          
##                                           
##  Mcnemar's Test P-Value : 2.579e-07       
##                                           
##             Sensitivity : 0.4684          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.6181          
##          Neg Pred Value : 0.6854          
##              Prevalence : 0.4086          
##          Detection Rate : 0.1914          
##    Detection Prevalence : 0.3097          
##       Balanced Accuracy : 0.6342          
##                                           
##        'Positive' Class : 0               
##                                           
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 201 111
##          1 188 430
##                                           
##                Accuracy : 0.6785          
##                  95% CI : (0.6474, 0.7084)
##     No Information Rate : 0.5817          
##     P-Value [Acc > NIR] : 8.014e-10       
##                                           
##                   Kappa : 0.3204          
##                                           
##  Mcnemar's Test P-Value : 1.107e-05       
##                                           
##             Sensitivity : 0.5167          
##             Specificity : 0.7948          
##          Pos Pred Value : 0.6442          
##          Neg Pred Value : 0.6958          
##              Prevalence : 0.4183          
##          Detection Rate : 0.2161          
##    Detection Prevalence : 0.3355          
##       Balanced Accuracy : 0.6558          
##                                           
##        'Positive' Class : 0               
## 
head(test_set_pred)
##                Date               Home HomePTS                  Away
## 1 Tue, Nov 28, 2017   Sacramento Kings      87       Milwaukee Bucks
## 2 Tue, Nov 28, 2017          Utah Jazz     106        Denver Nuggets
## 3 Wed, Nov 29, 2017    Detroit Pistons     131          Phoenix Suns
## 4 Wed, Nov 29, 2017      Orlando Magic     121 Oklahoma City Thunder
## 5 Wed, Nov 29, 2017 Philadelphia 76ers     118    Washington Wizards
## 6 Wed, Nov 29, 2017    New York Knicks     115            Miami Heat
##   AwayPTS HomeWin team_per_diff.W team_per_diff.TeamPTS
## 1     112       0            -5.0                -156.0
## 2      77       1            -2.0                -100.0
## 3     107       1             7.0                 -24.0
## 4     108       1             0.0                  99.0
## 5     113       1             1.0                  33.0
## 6      86       1             0.0                  66.0
##   team_per_diff.OpponentPTS team_per_diff.TeamTRB team_per_diff.TeamAST
## 1                      25.0                  14.0                 -49.0
## 2                    -121.0                -101.0                 -44.0
## 3                    -267.0                 -86.0                  77.0
## 4                     225.0                  -4.0                  63.0
## 5                      85.0                 131.0                  70.0
## 6                      47.0                  28.0                  50.0
##   team_per_diff.TeamSTL team_per_diff.TeamBLK team_per_diff.TeamTOV
## 1                  -5.0                 -28.0                  15.0
## 2                  31.0                  24.0                 -23.0
## 3                  19.0                 -41.0                 -40.0
## 4                 -56.0                   0.0                  16.0
## 5                  16.0                  14.0                  57.0
## 6                  -9.0                  -3.0                   0.0
##   team_per_diff.TeamPF team_per_diff.western team_per_diff.ConfRank
## 1                -61.0                   1.0                    6.0
## 2                 64.0                   0.0                    1.0
## 3                -87.0                  -1.0                  -10.0
## 4                -21.0                  -1.0                    0.0
## 5                 60.0                   0.0                   -2.0
## 6                 -5.0                   0.0                   -1.0
##   team_per_diff.WS_max team_per_diff.WS_num pred_win_prob pred_game
## 1                -12.4                 -1.0     0.3687277         0
## 2                  4.6                  1.0     0.8115673         1
## 3                  6.7                  1.0     0.7914600         1
## 4                -13.1                 -3.0     0.1865677         0
## 5                 -4.6                 -2.0     0.4979266         0
## 6                 -3.7                 -2.0     0.4253650         0

Simulation Framework

We then proceeded with running simulations to predict the 2019-20 Western Conference playoff teams. We created the functions WL_record_cal( ) to calculate the number of wins and losses for each team given the game schedule, and game_simulation( ) to simulate the process 100 times. To realize this, we first generate 300 team encounters randomly as a game schedule. We simulate each game using rbinom with the probability parameters equal to our predicted winning probabilities obtained from the generalized BT model. Given the numbers of wins and losses in the simulation, we would then rank the teams and determine the top 8 teams. Such a strategy would allow us to obtain for each team average, median, and confidence intervals of regular season wins as well as playoff probabilities.

Using the 2018-19 season as an illustration, our simulations result in HOU winning on average 62.4 of the 82 regular season games, and making it to the playoffs in all simulations, so in our mind the chance of HOU making the playoffs is 100%. On the other hand, we predict other teams (i.e. SAC and PHO) to have a 0% chance of making the playoffs, which means we are certain that these teams have no hopes of making the playoffs.

## Remaining 62 games
set.seed(2019)
B = 100
game_simulation = function(pred_result, B, year, Nfirst = 300){
  game_stats_year = get_game_year_feature(year)
  pretrain_set = game_stats_year[1:Nfirst,]
  train_set = game_stats_year[-c(1:Nfirst),]
  WL_record = cbind(west_ind,W = 0, L=0)
  
  WL_record_pre  = WL_record_cal(pretrain_set,WL_record)
  if (year!=2020){
    WL_record_post = WL_record_cal(train_set,WL_record)
    WL_record_all  = WL_record_cal(train_set,WL_record_pre)
    WL_record_west = WL_record_all %>%
      filter(western==1) %>%
      arrange(-W)
  }
  if (year==2020){
    WL_record_west = WL_record_pre
  }
  
  
  WL_simu_rep = NULL
  for(k in 1:B){
    simu_game = rep(0,nrow(pred_result))
    
    for(i in 1:nrow(pred_result)){
      simu_game[i] = rbinom(1,1, pred_result$pred_win_prob[i])
    }
    simu_result = pred_result %>%
      select(Home,Away,HomeWin) %>%
      mutate(HomeWin = simu_game)
    
    # WL_record = cbind(west_ind,W = 0, L= 0)
    WL_simu = WL_record_cal(simu_result,WL_record_pre)
    WL_simu_west = WL_simu %>%
      filter(western==1) %>%
      mutate(TeamRank = rank(-W,ties.method = "random")) %>%
      mutate(playoff = ifelse(TeamRank<=8,1,0))
    WL_simu_rep = rbind(WL_simu_rep, WL_simu_west)
  }
  ans = list(simu = WL_simu_rep, result = WL_record_west)
  return(ans)
}
simu_result = game_simulation(test_set_pred, B, year = 2018)
train_set_simu = simu_result$simu
train_set_summary = train_set_simu %>%
  group_by(Tm) %>%
  summarise(W_avg = mean(W),
            L_avg = 82 - W_avg,
            W_med = median(W),
            L_med = 82 - W_med,
            playoff_prob = sum(playoff)/B) %>%
  arrange(-W_avg)
print(train_set_summary)
## # A tibble: 15 x 6
##    Tm    W_avg L_avg W_med L_med playoff_prob
##    <fct> <dbl> <dbl> <dbl> <dbl>        <dbl>
##  1 HOU    62.4  19.6  62    20          1    
##  2 GSW    58.2  23.8  59    23          1    
##  3 UTA    51.7  30.3  52    30          0.99 
##  4 OKC    50.6  31.4  50    32          0.99 
##  5 NOP    45.2  36.8  45    37          0.84 
##  6 SAS    44.8  37.2  45    37          0.78 
##  7 MIN    44.6  37.4  45    37          0.7  
##  8 POR    43.6  38.4  44    38          0.7  
##  9 LAC    42.3  39.7  42    40          0.580
## 10 DEN    39.4  42.6  40    42          0.33 
## 11 LAL    33.9  48.1  34    48          0.05 
## 12 MEM    33.8  48.2  33.5  48.5        0.02 
## 13 DAL    33.0  49.0  32.5  49.5        0.02 
## 14 SAC    27.8  54.2  28    54          0    
## 15 PHO    22.0  60.0  22    60          0
WL_record_west = simu_result$result
print(WL_record_west)
##     Tm western               TeamName  W  L
## 1  HOU       1        Houston Rockets 65 17
## 2  GSW       1  Golden State Warriors 58 24
## 3  POR       1 Portland Trail Blazers 49 33
## 4  NOP       1   New Orleans Pelicans 48 34
## 5  UTA       1              Utah Jazz 48 34
## 6  OKC       1  Oklahoma City Thunder 48 34
## 7  MIN       1 Minnesota Timberwolves 47 35
## 8  SAS       1      San Antonio Spurs 47 35
## 9  DEN       1         Denver Nuggets 46 36
## 10 LAC       1   Los Angeles Clippers 42 40
## 11 LAL       1     Los Angeles Lakers 35 47
## 12 SAC       1       Sacramento Kings 27 55
## 13 DAL       1       Dallas Mavericks 24 58
## 14 MEM       1      Memphis Grizzlies 22 60
## 15 PHO       1           Phoenix Suns 21 61

Extension: Integrative Prediction Using Multiple Seasons Data

Year-Effect Exploration

As an extension, we also considered the year effect, or in other words the effect of clustering of different years’ data. Intuitively, this is important because if data from the 2019-20 season resemble data from one of the past seasons but not another, then the data more dissimilar from that of the 2019-20 season would be less representative.

To explore whether clustering effects exist, we created the function team_pc_plot(), which first uses principal component analysis (PCA) to reduce our higher dimensional data with a larger set of variables to a smaller set of just two principal components (PCs) that still manages to retain most of the information in the original data. The new coordinate system and transformed data defined in terms of the two PCs are shown. Concentration ellipses are then used to encircle data of all teams from the same season. From the graph, we can see that data for separate years are closely packed together and there is no clear pattern of a year-effect.

We also visualized the distance between every pair of seasons with a season distance matrix. We can see from the matrix that the closer together two years are, the smaller the in-between distance is, which is in line with our intuition.

library(factoextra)
combine_season_data = function(){
  year_list = c(2016:2020)
  team_stats = NULL
  for(i in 1:length(year_list)){
    year = year_list[i]
    team_stats_year = get_team_year_feature(year) 
    team_stats_year = team_stats_year %>%
      mutate(Season = paste0(year-1,'-',substr(year,3,4)),
             year = year)
    team_stats = rbind(team_stats,team_stats_year)
  }
  return(team_stats)
}
team_stats_all = combine_season_data()

## team pc plot
team_pc_plot = function(){
  #team_stats_all = combine_season_data()
  team_stats_X = team_stats_all %>%
    select(W,L,`W/L%`,TeamPTS, OpponentPTS,western,ConfRank,WS_max,WS_num) 
  team_stats_pc = prcomp(team_stats_X,scale. = T)$x[,1:2] # select first two PCs
  
  rownames(team_stats_X) = paste(team_stats_all$Tm,team_stats_all$Season)
  
  
  team_stats_pc = data.frame(team_stats_pc,team_stats_all[,c('Tm', 'TeamName','Season','western','year')])
  
  league_summary = team_stats_pc %>% 
    group_by(Season,year) %>%
    summarise(PC1_med = median(PC1),
              PC2_med = median(PC2)) %>%
    as.data.frame()
  rownames(league_summary) = league_summary$Season
  
  
  
  
  p1 = # team_stats_pc %>%
    ggplot(data = team_stats_pc, aes(x= PC1, y= PC2,col = factor(year)))+
    geom_point()+
    geom_text(aes(label=paste(Tm,Season)),size = 2, vjust = 0, nudge_y = 0.1)+
    labs(title = 'PC plots for NBA teams from 2015-16 to 2019-20')+
    theme(plot.title = element_text(hjust = 0.5))+
    theme(legend.position = "bottom")+
    geom_point(aes(x=PC1_med,y=PC2_med),data = league_summary, size = 5, shape = 7)
  # print(p1)
  
  p1.pca = fviz_pca_ind( prcomp(team_stats_X,scale. = T),
                         #label = "none", # hide individual labels
                         habillage = team_stats_all$year, # color by groups
                         addEllipses = TRUE, # Concentration ellipses
                         labelsize = 3)+ 
    labs(title = "PC plot of NBA teams from 2015-16 to 2019-20")+
    theme(plot.title = element_text(hjust = 0.5))
  print(p1.pca)
  
  
  year_dist = dist(league_summary[,-1],diag = T,upper = T)
  p3 = fviz_dist(year_dist, 
                 order=F,gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))+
    labs(title = "Season Distance Matrix")+
    theme(plot.title = element_text(hjust = 0.5))
  
  print(p3)
  return(year_dist)
} 
year_dist = team_pc_plot()

We created another function team_cluster_plot() that uses K-means clustering to group teams from multiple seasons into 5 clusters. The choice of 5 was based on the fact that we have data from 5 NBA seasons. We see that the algorithm results in clusters where observations from the same season are not necessarily grouped together in the same cluster. Instead, it seems that teams are clustered together based on how strong they are. For example, teams in the olive cluster are among the most competitive teams (e.g. GSW 2015-16), while teams in the blue cluster are among the least competitive teams (e.g. PHI 2015-16). Thus, again, it seems that season/year patterns are not apparent.

## kmeans clustering plot
library(factoextra)
team_cluster_plot = function(){
  team_stats_X = team_stats_all %>%
    select(W,L,`W/L%`,TeamPTS, OpponentPTS,western,ConfRank,WS_max,WS_num)  %>% 
    as.data.frame()
  rownames(team_stats_X) = paste(team_stats_all$Tm,team_stats_all$Season)
  k5 = kmeans(scale(team_stats_X), centers = 5, nstart = 15)
  p = fviz_cluster(k5, data = team_stats_X,
                   labelsize = 5)+
    labs(title = "cluster plot of NBA teams from 2015-16 to 2019-20")+
    theme(plot.title = element_text(hjust = 0.5))
  print(p)
}
team_cluster_plot()

Integrative Prediction

Despite the year-effect not being highly evident in our data, to make our model more precise and flexible, we still would like to compare the difference between a model with and without the year-effect taken into account. Our original model presented above involves the training of a single model using data across multiple seasons. To consider a model which incorporates the year effect, we created the function weighted_game_prediction( ), which trains a separate model for each year, and then makes weighted predictions, with the weight proportional to the inverse of the distance between a year and its following year.

Since we do not observe year-effect in our data, the prediction accuracy of the extended model does not improve compared to the original model. However, this function is still of great use as it provides users with an additional option and users can always choose the model that better suits their data.

weighted_game_prediction = function(pred_result,year_dist){
  year = (ncol(pred_result))/2
  print(year)
  year_dist_mat = as.matrix(year_dist)
  year_weight = (1/year_dist_mat[year+1,1:year])/sum(1/year_dist_mat[year+1,1:year])
  print(year_weight)
  
  for(i in 1:nrow(pred_result)){
    pred_result$pred_win_weighted_prob[i] = sum(pred_result[i,seq(1,2*year,by = 2)] * year_weight)
    pred_result$pred_game_vote[i] = ifelse(pred_result$pred_win_weighted_prob[i]>=0.5,1,0)
  }
  
  return(pred_result)
}

win_prob_prediction = function(train_year = c(2016,2017,2018), test_year = 2019, firstN = 300,
                               year_effect = 0){
  ## ignore year effect
  if (year_effect == 0){
    train_set = NULL
    for(i in 1:length(train_year)){
      train_set_tmp = get_train_test_data(train_year[i],firstN)$data_logit
      train_set = rbind(train_set, train_set_tmp)
    }
    test_set = get_train_test_data(test_year,firstN)$data_logit
    
    ## logit model
    logit_model = glm(HomeWin~ team_per_diff,family=binomial(link='logit'),
                      data = train_set)
    print(summary(logit_model)) # model summary
    
    ## training accuracy
    print("training accuracy")
    print(confusionMatrix(factor(ifelse(logit_model$fitted.values>0.5,1,0)),
                          factor(train_set$HomeWin)))
    
    ## testing accuracy
    pred_win_prob = predict(logit_model,newdata = test_set, type = "response")
    pred_game = ifelse(pred_win_prob>0.5,1,0)
    
    test_set_pred = cbind(test_set,pred_win_prob,pred_game)
    if (test_year!=2020){
      print("testing accuracy")
      print(confusionMatrix(factor(test_set_pred$pred_game),factor(test_set_pred$HomeWin)))
      
    }
    return(test_set_pred)
  }
  
  # consider year effect
  if (year_effect == 1){
    test_set = get_train_test_data(test_year,firstN)$data_logit
    test_set_pred = test_set
    
    for(i in 1:length(train_year)){
      train_set= get_train_test_data(train_year[i],firstN)$data_logit
      
      logit_model = glm(HomeWin~ team_per_diff,family=binomial(link='logit'),
                        data = train_set)
      
      pred_win_prob = predict(logit_model,newdata = test_set, type = "response")
      pred_game = ifelse(pred_win_prob>0.5,1,0)
      test_set_pred = data.frame(test_set_pred,pred_win_prob,pred_game)
    }
    
    a = weighted_game_prediction(test_set_pred[,-c(1:7)],year_dist = year_dist)
    ans = cbind(test_set_pred[,c(1:7)],a)
    if (test_year!=2020){
      print(confusionMatrix(factor(a$pred_game_vote),factor(test_set_pred$HomeWin)))
    }
    return(ans)
  }
}

win_prob_pred20_all = win_prob_prediction(train_year = c(2016:2019),test_year = 2020)
## 
## Call:
## glm(formula = HomeWin ~ team_per_diff, family = binomial(link = "logit"), 
##     data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3765  -1.0816   0.6199   0.9596   2.2414  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.4186805  0.0363912  11.505  < 2e-16 ***
## team_per_diffW            0.0439647  0.0262862   1.673  0.09442 .  
## team_per_diffTeamPTS     -0.0004349  0.0007099  -0.613  0.54016    
## team_per_diffOpponentPTS -0.0013161  0.0006936  -1.897  0.05777 .  
## team_per_diffTeamTRB      0.0004538  0.0006851   0.662  0.50773    
## team_per_diffTeamAST      0.0022221  0.0007812   2.845  0.00445 ** 
## team_per_diffTeamSTL      0.0010245  0.0014531   0.705  0.48076    
## team_per_diffTeamBLK     -0.0021718  0.0014684  -1.479  0.13914    
## team_per_diffTeamTOV      0.0013060  0.0013544   0.964  0.33490    
## team_per_diffTeamPF       0.0018898  0.0008735   2.164  0.03050 *  
## team_per_diffwestern      0.1192149  0.0623308   1.913  0.05580 .  
## team_per_diffConfRank    -0.0130932  0.0166226  -0.788  0.43089    
## team_per_diffWS_max       0.0290981  0.0095124   3.059  0.00222 ** 
## team_per_diffWS_num       0.2158275  0.0342267   6.306 2.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5042.1  on 3719  degrees of freedom
## Residual deviance: 4475.1  on 3706  degrees of freedom
## AIC: 4503.1
## 
## Number of Fisher Scoring iterations: 4
## 
## [1] "training accuracy"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  760  432
##          1  774 1754
##                                           
##                Accuracy : 0.6758          
##                  95% CI : (0.6605, 0.6908)
##     No Information Rate : 0.5876          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3081          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4954          
##             Specificity : 0.8024          
##          Pos Pred Value : 0.6376          
##          Neg Pred Value : 0.6938          
##              Prevalence : 0.4124          
##          Detection Rate : 0.2043          
##    Detection Prevalence : 0.3204          
##       Balanced Accuracy : 0.6489          
##                                           
##        'Positive' Class : 0               
## 
win_prob_pred20_weight = win_prob_prediction(train_year = c(2016:2019),test_year = 2020,year_effect = 1)
## [1] 4
##   2015-16   2016-17   2017-18   2018-19 
## 0.1249718 0.1764922 0.2431241 0.4554119
win_prob_pred20_weight = win_prob_pred20_weight[,c(1:7,16,17)]



B = 100
simu_result = game_simulation(win_prob_pred20_all, B, year = 2020)
train_set_simu = simu_result$simu
train_set_summary = train_set_simu %>%
  group_by(Tm) %>%
  summarise(W_avg = mean(W),
            L_avg = 82 - W_avg,
            W_med = median(W),
            L_med = 82 - W_med,
            playoff_prob = sum(playoff)/B) %>%
  arrange(-W_avg)

WL_record_west = simu_result$result %>%
  filter(western==1)

print(train_set_summary)
## # A tibble: 15 x 6
##    Tm    W_avg L_avg W_med L_med playoff_prob
##    <fct> <dbl> <dbl> <dbl> <dbl>        <dbl>
##  1 LAL    61.7  20.3    61    21         1   
##  2 LAC    60.6  21.4    60    22         1   
##  3 DEN    57.8  24.2    58    24         1   
##  4 UTA    52.0  30.0    52    30         1   
##  5 HOU    51.4  30.6    52    30         1   
##  6 DAL    47.5  34.5    48    34         0.99
##  7 OKC    40.7  41.3    41    41         0.75
##  8 MIN    37.2  44.8    37    45         0.44
##  9 POR    36.6  45.4    37    45         0.43
## 10 SAC    34.9  47.1    35    47         0.2 
## 11 SAS    33.2  48.8    33    49         0.09
## 12 PHO    33.0  49.0    33    49         0.09
## 13 NOP    30.0  52.0    30    52         0.01
## 14 MEM    25.8  56.2    26    56         0   
## 15 GSW    19.6  62.4    19    63         0
saveRDS(WL_record_west,"./data/pre_20_games.rds")
saveRDS(win_prob_pred20_all,"./data/post_62_games_pred.rds")

Documentation of Wrap-up Functions

  • Data Scraping
    • get_schedule_data(year,month): obtain the game schedules and results in specific year and month
    • combine_game_data(): combine the game schedules data
    • get_team_stats(year): obtain the team-level data (30 teams in the league) of first 20 games in specific year
  • Data Wrangling and Feature Engineering
    • get_ws_data(year): obtain the win share data in specific year
    • get_team_year_feature(year): obtain the team performance feature vectors in specific year
    • get_game_year_feature(year): obtain the game results in tidy format in specific year
    • get_injury_data(year), combine_injury_data(): obtain and combine the injury data
  • EDA
    • team_def_off_plot(year): show each team’s defense vs offense in specific year
    • team_rank_plot(year,Nfirst): show each team’s season rank vs first N games’ rank in specific year
    • home_way_plot(year): show each team’s home and away records in specific year; show western vs eastern records
    • team_ws_plot(year): show each team’s highest winshare in specific year
    • player_ws_plot(): show the winshare of top players in the league in recent years
    • team_multiple_game_data(year): wrangle data for team winnnig/losing streaks as well as team back-to-back games
  • Model Building
    • get_train_test_data(year,firstN): obtain the training and test data of first N games in specific year.
    • build_logit_model_year(train_year, test_year, firstN): use the first N games of train_year to build the model, and test the model on test_year. return the prediction results (W/L) and winning probability of each game.
  • Integrative Prediction
    • combine_season_data(): combine the team performance feature data of multiple seasons
    • team_pc_plot(): create the PC plot for the team performances of all the 30 teams in multiple seasons; the distance matrix of season data
    • team_cluster_plot(): create the cluster plot for each team in multiple seasons
    • win_prob_prediction(train_year, test_year,firstN,year_effect): The winning probability prediction function allow for year_effect
  • Simulation
  • WL_record_cal(game_schedule,WL_record): calculate the W/L records for each team given the game schedule
  • game_simulation(pred_result,B,year,Nfirst): given the predict results in specific year, repeat the game simulation for B times

Summary

The National Basketball Association (NBA) is a professional basketball league in North America that draws the attention of many basketball fans all over the world. The 2019-20 NBA season is currently ongoing, with the regular season having started in October and scheduled to end in April 2020, at which point the playoffs will begin. Rather than wait until next April, as NBA fans well trained in data science, we decided to work on a project with the objective of predicting the Western Conference NBA playoff teams for the current season. Along the process, we also hope to explore what factors can contribute to teams winning more games and thus being able to make the playoffs.

The target audience we have in mind are NBA data analysts and NBA fans. We have prepared an analysis report and a project website that contain some of the more technical details of the analysis that may be of more interest to analysts, as well as a screencast that captures some of the major findings of the project, which NBA fans will hopefully enjoy learning about.

We used data scraped mainly from Basketball Reference (https://www.basketball-reference.com/) with additional injury data scraped from spotrac (https://www.spotrac.com/nba/injured-reserve/). There was the need to scrape the data because there are no well-formatted datasets readily available and there are multiple data tables within the websites that are of interest to us. After the scraping and tidying of the data, the dataset contains game-level, team-level and player-level statistics that we hope to further explore.

To select features that can best predict team performance, we first performed exploratory data analysis. In this process, we learned that on the team-level, the combination of team points and opponent points is needed to capture team performance. We also learned that while over half of the teams that rank in the top eight at the end of the first 20 games will later make the playoffs, there is still a long way to go (62 remaining games) until the end of the regular season, and conference standings are still subject to many changes. In addition, while 3-pointers are becoming a more common way of scoring in the league and many of the best teams in the league in the past few seasons are known to have some of the best 3-point shooters in NBA history, there are no obvious patterns of how volume or accuracy of 3-point shot attempts by teams are related to team winning percentages. Team-level statistics are important in defining team performance, but the value of individual talent on teams is not to be overlooked. Most of the teams that make the playoffs have one or multiple star players (as characterized by a high win shares stats), so it is important to consider their presence on the team and whether they are in good health to play. Asides from team-level and player-level statistics, there are other factors that affect the outcome of a game including where the game is played (i.e. whether there is a home-court advantage) and who the game is played against (e.g. an opponent from the Western Conference may be stronger and harder to beat).

Through the exploratory data analysis process, we decided that the performance of a team could be captured by team-level, player-level statistics and other specific effects. Team-level statistics include: team total/average points, opponent total/average points and other secondary statistics such as total rebounds, assists, blocks, turnovers, personal fouls. Player-level statistics include the highest win share value on the team and number of league top 100 win share players on the team, both taking into account severe player injuries and team transfers. Other specific effects include: home-court advantage and western-eastern effect.

To incorporate these features into our prediction, we built a generalized Bradley-Terry logistic regression model. The model is parametrized by the features selected above. We used a cross-entropy loss to validate the model. The model would generate the probability of wins and losses for each of the two teams. If a team has a probability of winning that is greater than 0.5, we would regard the team as the winner, and vice versa. We then used the winning probabilities given by the model in simulations. We first generated 300 team encounters randomly as a game schedule. We would then assign the probability of winning for each team based on our previous model prediction results. Given the numbers of wins and losses in the simulation, we would then rank the teams and determine the top eight teams. We also considered a model that accounts for year-effect, or in other words the effect of clustering of different years’ data. However, we do not observe an apparent year-effect in our data, so this extended model did not improve our prediction accuracy.

Based on our analysis above, the eight teams in the Western Conference we predict to make the playoffs this season are: Los Angeles Lakers Los Angeles Clippers Denver Nuggets Utah Jazz Houston Rockets Dallas Mavericks Oklahoma City Thunder Minnesota Timberwolves

Limitations

The major limitation of this project is that the team feature vector used in the prediction model may not be representative of a team at a particular point in time. A team’s ability is subject to change as the season progresses. For one thing, accidental injuries occur often during games. If a star player gets seriously injured like Kevin Durant did last year, the team’s performance will be heavily undermined. Our model accounts for current known player injuries, but cannot anticipate injuries that will occur later in the season. In addition, a team’s roster may change as players get traded.

Future works

To improve our current model, three more features could be further explored: streak effect, fatigue (i.e. back-to-back games), and coaching staff. Our analysis did confirm the independence between streak/fatigue and game results for most teams individually. However, the p-value for the t-test between the marginal winning probability and the conditional winning probabilities given fatigue is 0.06 when incorporating all the teams. If we could come up with other ways to characterize streak and fatigue, the two features may help increase the model accuracy. Also, a coach is key to a team and incorporating coaching effects may give better predictions. Besides predicting playoffs, data could be used to help teams improve performances. We are interested in using player statistics and tracking data to investigate decision making during a game. Some of the potential questions are: what is the optimal lineup for a team when facing a specific opponent and at a specific time; what is the best shooting location for a player; at a given time and place, should a player pass the ball or shoot, etc.