The data
folder contains our data scraped from BASKETBALL REFERENCE
game_stats_all.csv
- The game results (scores) for all regular season games from season 2015-16 to 2018-19 and for regular season games already played in season 2019-20 and the upcoming schedule for the remaining games in season 2019-2020.team_stats_all.csv
- The team statistics of first 20 games from season 2015-16 to 2019-20.pre_20_games.rds
- the results for the first 20 games for each team during season 2019-20post_62_games_pred.rds
- predictions for the rest of season 2019-2020You only need the first two csv files to run the rmarkdown code.
The National Basketball Association (NBA) is a professional basketball league in North America that draws the attention of many basketball fans all over the world. During the regular season, each of the 30 teams (15 in the Eastern Conference and 15 in the Western Conference) plays 82 games. At the conclusion of the regular season, the top eight teams in each conference advance to the playoffs for a chance to compete for the championship title. Towards the end of the regular season, the race for the final few playoff spots is very intense, and usually it isn’t until the final game of the regular season is over when we know which teams have made the cut. As NBA fans ourselves, we are eager to find out which teams will make the playoffs this year (2019-20 NBA season), but rather than wait until next April, as data analysts, we thought it would be interesting if we could make such predictions now. We hope our analysis will lead to meaningful insights that will be of interest to NBA fans and NBA data analysts all over the world.
The goal of this project is to build a statistical machine learning model that can predict the regular season winning percentages of the NBA teams in the Western Conference, and thus the playoff teams (the eight teams with the highest winning percentages).
We will undergo a comprehensive background study of the NBA to manually decide on what features to use. Exploratory data analysis will aid in our selection of features. We also plan to evaluate different models. A successful model will predict the 2019-2020 NBA playoff teams in the Western Conference. Ideally, the model could also be trained to predict the playoff teams in the Eastern Conference and finally round-by-round playoff predictions, though this is beyond the scope of our project.
The target audience we have in mind are NBA data analysts and NBA fans. We have prepared an analysis report and a project website that contain some of the more technical details of the analysis that may be of more interest to analysts, as well as a screencast that captures some of the major findings of the project, which NBA fans will hopefully enjoy learning about.
Most data were collected from the Basketball Reference website (https://www.basketball-reference.com/). The database contains player statistics including points per game, assists per game, rebounds per game, etc. of all players who have played in the NBA league since 1949-50 (the inaugural season of the NBA). The database also provides team statistics, including team 3-point percentages, winning percentages, etc. Player injury data were obtained from spotrac (https://www.spotrac.com/nba/injured-reserve/).
We used rvest
package to scrape the following data from Basketball Reference website:
team_stats
: The team statistics of first 20 games from season 2015-16 to 2019-20. We would use these data to define the team performance during specific season in the feature engineering section.
game_stats
: The game results (scores) for all regular season games from season 2015-16 to 2018-19 and for regular season games already played in season 2019-20 (as of December 11, 2019). We would use these data in the model building section.
team_stats = readr::read_csv("./data/team_stats_all.csv")
game_stats = readr::read_csv("./data/game_stats_all.csv")
head(team_stats)
## # A tibble: 6 x 50
## Rk Tm Season G W L `W/L%` MP TeamFG TeamFGA
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 MIL 2018-~ 20 14 6 0.7 4825 894 1835
## 2 2 HOU 2019-~ 20 13 7 0.65 4850 815 1823
## 3 3 MIL 2019-~ 20 17 3 0.85 4825 878 1830
## 4 4 GSW 2016-~ 20 17 3 0.85 4850 873 1738
## 5 5 WAS 2019-~ 20 7 13 0.35 4800 898 1884
## 6 6 NOP 2018-~ 20 10 10 0.5 4800 888 1858
## # ... with 40 more variables: `TeamFG%` <dbl>, Team2P <dbl>,
## # Team2PA <dbl>, `Team2P%` <dbl>, Team3P <dbl>, Team3PA <dbl>,
## # `Team3P%` <dbl>, TeamFT <dbl>, TeamFTA <dbl>, `TeamFT%` <dbl>,
## # TeamPTS <dbl>, TeamORB <dbl>, TeamDRB <dbl>, TeamTRB <dbl>,
## # TeamAST <dbl>, TeamSTL <dbl>, TeamBLK <dbl>, TeamTOV <dbl>,
## # TeamPF <dbl>, OpponentFG <dbl>, OpponentFGA <dbl>,
## # `OpponentFG%` <dbl>, Opponent2P <dbl>, Opponent2PA <dbl>,
## # `Opponent2P%` <dbl>, Opponent3P <dbl>, Opponent3PA <dbl>,
## # `Opponent3P%` <dbl>, OpponentFT <dbl>, OpponentFTA <dbl>,
## # `OpponentFT%` <dbl>, OpponentPTS <dbl>, OpponentORB <dbl>,
## # OpponentDRB <dbl>, OpponentTRB <dbl>, OpponentAST <dbl>,
## # OpponentSTL <dbl>, OpponentBLK <dbl>, OpponentTOV <dbl>,
## # OpponentPF <dbl>
head(game_stats)
## # A tibble: 6 x 11
## Date Start..ET. Visitor.Neutral PTS Home.Neutral PTS.1 Var.7 Var.8
## <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 Tue,~ 8:00p Detroit Pistons 106 Atlanta Haw~ 94 Box ~ <NA>
## 2 Tue,~ 8:00p Cleveland Cava~ 95 Chicago Bul~ 97 Box ~ <NA>
## 3 Tue,~ 10:30p New Orleans Pe~ 95 Golden Stat~ 111 Box ~ <NA>
## 4 Wed,~ 7:00p Washington Wiz~ 88 Orlando Mag~ 87 Box ~ <NA>
## 5 Wed,~ 7:30p Indiana Pacers 99 Toronto Rap~ 106 Box ~ <NA>
## 6 Wed,~ 7:30p Charlotte Horn~ 94 Miami Heat 104 Box ~ <NA>
## # ... with 3 more variables: Attend. <dbl>, Notes <chr>, Season <chr>
We do not include the code for web scraping in this rmarkdown file. Instead, please refer to the data_scraping.R file in the source folder for more details. Here we just show the first few rows of the data.
Data wrangling is performed throughout the whole data analysis process covering data tidying, exploratory data analysis, feature engineering, and model building. Here we demonstrate some necessary data wrangling procedures performed to prepare the dataset for further analysis.
We list all 30 teams in the league (both full franchise name and abbreviation), and add an indicator to differentiate the 15 teams in the Western Conference and the 15 in the Eastern Conference.
library(rvest)
library(tidyverse)
west_ind = data.frame(
Tm = c("CHI","LAL","POR","WAS","CHO","MIL","BRK","DET","HOU","NOP","IND","MEM","DEN","LAC","ATL","NYK","MIA","ORL","DAL","PHO",
"SAC","UTA","MIN","SAS","CLE","TOR","BOS","PHI","GSW","OKC"),
western = c(0,1,1,0,0,0,0,0,1,1,0,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,0,0,1,1),
TeamName = c("Chicago Bulls","Los Angeles Lakers","Portland Trail Blazers", "Washington Wizards","Charlotte Hornets",
"Milwaukee Bucks","Brooklyn Nets" ,"Detroit Pistons","Houston Rockets" ,"New Orleans Pelicans",
"Indiana Pacers","Memphis Grizzlies","Denver Nuggets","Los Angeles Clippers","Atlanta Hawks",
"New York Knicks" ,"Miami Heat","Orlando Magic","Dallas Mavericks","Phoenix Suns" ,
"Sacramento Kings","Utah Jazz", "Minnesota Timberwolves","San Antonio Spurs","Cleveland Cavaliers",
"Toronto Raptors" ,"Boston Celtics","Philadelphia 76ers" ,"Golden State Warriors","Oklahoma City Thunder")
)
get_ws_data(year)
: We obtain the win shares stats (estimate of the number of wins contributed by a player) of the players ranked in the top 100 of the league in terms of win shares (WS) in a particular season.
get_ws_data = function(year){
ws_url = paste0("https://www.basketball-reference.com/play-index/psl_finder.cgi?request=1&match=single&type=totals&per_minute_base=36&per_poss_base=100&season_start=1&season_end=-1&lg_id=NBA&age_min=0&age_max=99&is_playoffs=N&height_min=0&height_max=99&year_min=",
year,"&year_max=",year,"&birth_country_is=Y&as_comp=gt&as_val=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=ws")
ws_stats = ws_url %>% read_html %>%
html_table() %>%
.[[1]]
ws_stats = ws_stats[,c(2,5,7)]
colnames(ws_stats)= ws_stats[1,]
ws_stats = ws_stats[-c(1,which(ws_stats$Player=='Player')),]
ws_stats$WS = as.numeric(ws_stats$WS )
ws_stats_summary = ws_stats %>%
group_by(Tm,Player) %>%
summarise(WS_max = max(WS), WS_num = n())
return(ws_stats_summary)
}
get_injury_data(year)
: With basketball being a high impact sport and the NBA being one of the most competitive sports leagues, it is not uncommon for NBA players to experience injuries, some of which will cause players to miss games. We obtain player injury data for players who have missed games due to injury from: https://www.spotrac.com/nba/injured-reserve/ (which sorts the injured players by cash earned while injured). The website provides a separate data table for each season, and we created the combine_injury_data()
function to combine injury data from all relevant seasons.
get_injury_data = function(year){
injury_url = paste0("https://www.spotrac.com/nba/injured-reserve/",
year-1)
injury_stats = injury_url %>%
read_html %>%
html_table() %>%
.[[4]]
colnames(injury_stats)[1] = "Player"
injury_stats = injury_stats %>%
separate(Player,sep="\n",into = c("Player","other")) %>%
mutate(Games = as.numeric(Games),
Salary = as.numeric(gsub('[$,]', '', injury_stats$`Cash EarnedWhile Injured`))) %>%
select(Player,Team,Games,Salary) %>%
filter(Games!='NA') %>%
mutate(Season = paste0(year-1,'-',substr(year,3,4)))
return(injury_stats)
}
injury_2020 = get_injury_data(2016)
combine_injury_data = function(){
year_list = c(2017:2020)
injury_stats = NULL
for(i in 1:length(year_list)){
injury_stats_tmp = get_injury_data(year_list[i])
injury_stats = rbind(injury_stats, injury_stats_tmp)
}
return(injury_stats)
}
injury_stats = combine_injury_data()
Besides player injuries, team rosters of available players to play change also due to NBA trades and transactions where players change teams. We take both player injuries and team transfers into account, and define the adjusted highest-win-shares-on-team metric. This metric equals the win shares of the player with the highest win shares on each team (only considering players with win shares in the top 100 of the league).
For example, in the 2018-19 season, the Los Angeles Clippers (LAC) had four players ranked in the top 100 in terms of WS: Montrezl Harrell (WS 8.7), Danilo Gallinari (WS 8.2), Lou Williams (WS 5.1), and Patrick Beverley (WS 4.8). For the current 2019-20 season, Danilo Gallinari is now playing for the Oklahoma City Thunder (OKC) while LAC has welcomed new team members Paul George (2018-19 WS 11.9) and Kawhi Leonard (2018-19 WS 9.5), so the adjusted highest-win-shares on LAC for the 2019-20 season is 11.9 (WS_max
=11.9). We also define the number-of-high-win-shares-on-team metric, which is the number of players on a team who have a win shares that is ranked in the top 100 in the league (from the previous season). Taking LAC for example, the team has five players with a top 100 win shares (WS_num
=5).
In addition, players with injuries during a season are left out when defining the team win shares metrics. For example, during the 2018-19 season, Golden State Warriors (GSW) members Stephen Curry, Kevon Looney and Klay Thompson all ranked in the top 100 in terms of WS. However, during the current 2019-20 season, all three players are out due to injury, and are thus not considered to be active players and thus do not contribute to GSW’s team win shares metrics.
get_ws_injury_trans_data = function(year){
Winshare_last_year = get_ws_data(year-1)
Winshare_this_year = get_ws_data(year)
injury_year = get_injury_data(year)
winshare_injury_ind = Winshare_this_year %>%
left_join(injury_year, by = c("Player","Tm"="Team")) %>%
mutate(injury_ind = ifelse(is.na(Season),0,1))
winshare_this_year_injury_trans = winshare_injury_ind %>%
right_join(Winshare_last_year, by = c("Player")) %>%
select(Tm.x,Player,Tm.y,WS_max.y,WS_num.y) %>%
ungroup() %>% na.omit() %>%
group_by(Tm.x) %>%
summarise(WS_max = max(WS_max.y), WS_num = n(),Player = Player[which.max(WS_max.y)])
colnames(winshare_this_year_injury_trans)[1]='Tm'
return(winshare_this_year_injury_trans)
}
winshare_feature = get_ws_injury_trans_data(2020)
head(winshare_feature)
## # A tibble: 6 x 4
## Tm WS_max WS_num Player
## <chr> <dbl> <int> <chr>
## 1 BOS 7.4 4 Kemba Walker
## 2 BRK 9.1 5 Kyrie Irving
## 3 CHI 4.7 1 Tomáš Satoranský
## 4 DAL 7.5 2 Dwight Powell
## 5 DEN 11.8 3 Nikola Jokic
## 6 DET 10 1 Andre Drummond
get_team_year_feature(year)
: We create a function that organizes team data in the first 20 games for all teams in a particular season. This process includes extracting from a larger database relevant variables which include number of wins, number of losses, winning percentages, total points scored by team, total points scored by opponents, total team rebounds, total team assists, total team steals, total team blocks, total team turnovers, and total team personal fouls. We also append conference ranking of each team at the end of the first 20 games as well as the WS_max
and WS_num
metrics we computed above.
get_team_year_feature = function(year){
#team_stats = get_team_stats() # get the team stats for all seasons
team_stats_year = team_stats %>%
filter(grepl(as.character(year-1),Season)) # filter the specific year
team_stats_year[,-c(2,3)] = as.numeric(as.matrix(team_stats_year[,-c(2,3)])) # turn characters into numeric
## select Win, Loss, Win-Loss per, Team points, Opponent points
team_stats_year_feature = team_stats_year %>%
select(Tm, W, L, `W/L%`, TeamPTS, OpponentPTS,TeamTRB,TeamAST,TeamSTL,TeamBLK, TeamTOV,TeamPF) %>%
inner_join(west_ind, by = 'Tm')
## assign the team rank
team_stats_year_rank = team_stats_year_feature %>%
group_by(western) %>%
mutate(ConfRank = order(order(W, TeamPTS, decreasing = T)))
## add the winshare feature from previous season
Winshare = get_ws_injury_trans_data(year)
#print(head(Winshare))
team_stats_year_rank = team_stats_year_rank %>% left_join(Winshare, by = 'Tm') %>%
select(-Player)
team_stats_year_rank$WS_num[is.na(team_stats_year_rank$WS_num)] = 0
team_stats_year_rank$WS_max[is.na(team_stats_year_rank$WS_max)] = 0
return(team_stats_year_rank)
}
team_stats_18to19 = get_team_year_feature(2019)
head(team_stats_18to19)
## # A tibble: 6 x 17
## # Groups: western [2]
## Tm W L `W/L%` TeamPTS OpponentPTS TeamTRB TeamAST TeamSTL
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MIL 14 6 0.7 2420 2199 1012 545 138
## 2 NOP 10 10 0.5 2375 2345 966 540 140
## 3 LAC 14 6 0.7 2335 2236 945 443 117
## 4 TOR 16 4 0.8 2327 2149 945 512 174
## 5 GSW 13 7 0.65 2318 2208 907 564 147
## 6 CHO 10 10 0.5 2304 2215 887 495 144
## # ... with 8 more variables: TeamBLK <dbl>, TeamTOV <dbl>, TeamPF <dbl>,
## # western <dbl>, TeamName <fct>, ConfRank <int>, WS_max <dbl>,
## # WS_num <dbl>
get_game_year_feature(year)
: We also create a function that organizes data of all regular season games in a particular season. This process includes extracting from a larger database relevant variables which include date of game, name of home team and visiting team, and points scored by home team and visiting team. We also append an indicator variable indicating whether the game was won by the home team.
get_game_year_feature = function(year){
game_stats_year = game_stats %>%
filter(grepl(as.character(year-1),Season))
## game stats
## the results of every game in the season
game_stats_feature = game_stats_year[,c(1,5,6,3,4)]
game_stats_feature[,c(3,5)] = as.numeric(as.matrix(game_stats_feature[,c(3,5)]))
colnames(game_stats_feature) = c("Date","Home","HomePTS","Away","AwayPTS")
game_stats_feature = game_stats_feature %>%
mutate(HomeWin = ifelse(HomePTS > AwayPTS,1,0))
return(game_stats_feature)
}
game_stats_18to19 = get_game_year_feature(2019)
head(game_stats_18to19)
## # A tibble: 6 x 6
## Date Home HomePTS Away AwayPTS HomeWin
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 Tue, Oct 16, ~ Boston Celtics 105 Philadelphia 76~ 87 1
## 2 Tue, Oct 16, ~ Golden State War~ 108 Oklahoma City T~ 100 1
## 3 Wed, Oct 17, ~ Charlotte Hornets 112 Milwaukee Bucks 113 0
## 4 Wed, Oct 17, ~ Detroit Pistons 103 Brooklyn Nets 100 1
## 5 Wed, Oct 17, ~ Indiana Pacers 111 Memphis Grizzli~ 83 1
## 6 Wed, Oct 17, ~ Orlando Magic 104 Miami Heat 101 1
We compared team points per game and opponent points per game of all teams by conference, and explored how these stats were related to conference standings. In general, a team that is strong in offense will have a relatively high team score per game, and a team that is strong in defense will have a relatively low opponent score per game. From the plot below, we can see that some highly ranked teams are strong both offensively and defensively (e.g. TOR), while some teams make up for their weaker offense through strong defense (e.g. IND). Thus, team points or opponent points alone may not be sufficient to quantify the strength of a team, but the use of both stats or the net score (team points - opponent points) may be more appropriate.
team_def_off_plot = function(year){
team_stats_year = get_team_year_feature(year)
conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")
p = team_stats_year %>%
ggplot(aes(x = TeamPTS/20, y = OpponentPTS/20))+
geom_point(aes(size = -ConfRank, col = factor(western)))+
geom_text(aes(label=Tm),vjust = 0, nudge_y = 0.5)+
geom_abline(slope = 1,intercept = 0,linetype="dashed", color = "red",size = 1)+
facet_grid(~western, labeller = as_labeller(conference_names))+
labs(title = paste0('Points Scored in First 20 games (Season ',year-1,'-',substr(year,3,4),")"))+
theme(plot.title = element_text(hjust = 0.5))+
coord_fixed()+
theme(legend.position = "bottom") +
guides(colour=FALSE) +
scale_size("Conference standing after 20 games",breaks=c(-15,-10,-5),labels=c(15,10,5))+
xlab("Team Points (average per game)")+
ylab("Opponent Points (average per game)")
print(p)
}
#png("./plots/team_def_off_plot.png",width = 960,height = 480)
team_def_off_plot(2019)
#dev.off()
We next explored how the standings and W-L records of teams after 20 regular games compared to the final standings and records at the end of the regular season. The plots of conference standings show that many teams that performed well in the beginning of the regular season continued to perform well throughout the season (e.g. GSW & DEN), while some teams that were not faring well earlier on finished the season strong (e.g. HOU & UTA), and still other teams that were in the top 8 after 20 regular season games did not make it to the playoffs in the end (e.g. MEM & LAL). In the team records plot, the line shows the expected number of regular season wins if teams were to maintain the same first-20-game-winning-percentage throughout the regular season. The bandwidth represents the 95% confidence interval for predictions of regular season wins. We can see that while some teams had a consistent performance throughout the season (e.g. MIN & OKC), other teams had better (e.g. HOU) or worse (e.g. MEM) performance in the remaining 62 games. Thus, we learn that even if a team starts the season poorly, there is still an oppurtunity for the team to improve and ultimately reach the playoffs.
WL_record_cal = function(game_schedule,WL_record){
for(i in 1:nrow(game_schedule)){
if (game_schedule$HomeWin[i]==1){
WL_record$W[which(WL_record$TeamName==game_schedule$Home[i])] =
WL_record$W[which(WL_record$TeamName==game_schedule$Home[i])] + 1
WL_record$L[which(WL_record$TeamName==game_schedule$Away[i])] =
WL_record$L[which(WL_record$TeamName==game_schedule$Away[i])] + 1
}
if (game_schedule$HomeWin[i]==0){
WL_record$L[which(WL_record$TeamName==game_schedule$Home[i])] =
WL_record$L[which(WL_record$TeamName==game_schedule$Home[i])] + 1
WL_record$W[which(WL_record$TeamName==game_schedule$Away[i])] =
WL_record$W[which(WL_record$TeamName==game_schedule$Away[i])] + 1
}
}
return(WL_record)
}
team_rank_plot = function(year, Nfirst = 300){
team_stats_year = get_team_year_feature(year)
game_stats_year = get_game_year_feature(year)
colnames(team_stats_year)[c(2,3)] = c("W.pre","L.pre")
WL_record = cbind(west_ind,W = 0, L=0)
WL_record = WL_record_cal(game_stats_year,WL_record)
colnames(WL_record)[c(4,5)] = c("W.all","L.all")
WL_record = WL_record %>%
group_by(western) %>%
mutate(Season_Rank = order(order(W.all, decreasing = T))) %>%
inner_join(team_stats_year[,c(1,2,3,14,15)],by=c("Tm","TeamName"))
WL_record = WL_record %>%
mutate(Playoff = Season_Rank <= 8)
conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")
p = WL_record %>%
ggplot(aes(x = ConfRank, y = Season_Rank))+
geom_point(aes(col = factor(western), size = factor(Playoff)))+
geom_text(aes(label=Tm),vjust = 0, nudge_y = 0.5)+
geom_abline(slope = 1,intercept = 0,linetype="dashed", color = "red",size = 1)+
geom_vline(xintercept=8, linetype="dotted", color = "blue",size = 1)+
facet_grid(~western, labeller = as_labeller(conference_names))+
labs(title = paste0('Conference Standings (Season ',year-1,'-',substr(year,3,4),")"))+
theme(plot.title = element_text(hjust = 0.5))+
coord_fixed()+
theme(legend.position = "bottom")+
xlab("Conference standing after 20 regular season games")+
ylab("Final conference standing after all 82 regular season games")+
guides(colour=FALSE, size=guide_legend(title="Playoff team"))+
scale_size_discrete(range = c(1.5, 4))
print(p)
p2 = WL_record %>%
ggplot(aes(x = W.pre, y =W.all))+
geom_point(aes(col = factor(western),size = factor(Playoff)))+
geom_smooth(method = "lm")+
geom_text(aes(label=Tm),vjust = 0, nudge_y = 0.5)+
facet_grid(~western, labeller = as_labeller(conference_names))+
labs(title = paste0('Team Records (Season ',year-1,'-',substr(year,3,4),")"))+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position = "bottom")+
xlab("Number of wins in first 20 regular season games")+
ylab("Number of wins in all 82 regular season games")+
guides(colour=FALSE, size=guide_legend(title="Playoff team"))+
scale_size_discrete(range = c(1.5, 4))
print(p2)
}
#png("./plots/team_rank_plot.png",width = 960,height = 480)
#team_rank_plot(2019)
#dev.off()
#png("./plots/team_win_plot.png",width = 960,height = 480)
team_rank_plot(2019)
#dev.off()
In recent years, teams are shooting more 3-pointers than ever before. According to records from Basketball Reference, the total number of 3-pointers attempted by the league has increased each of the past seven seasons from 36,395 in the 2011-12 season to 78,742 in the 2018-19 season (https://www.basketball-reference.com/leagues/NBA_stats_totals.html). We were wondering whether teams that were more skilled in shooting 3-pointers or were better at defending 3-point attempts from opponents were able to perform better in the league. From the exploratory plots below, we cannot see obvious patterns of how volume or accuracy of 3-point shot attempts by teams are related to team winning percentages. The same applies to 2-point shot attempts.
WL_record = cbind(west_ind,W = 0, L=0)
WL_record = rbind(WL_record_cal(get_game_year_feature(2019),WL_record),WL_record_cal(get_game_year_feature(2018),WL_record),
WL_record_cal(get_game_year_feature(2017),WL_record),WL_record_cal(get_game_year_feature(2016),WL_record))
colnames(WL_record)[c(4,5)] = c("W.all","L.all")
WL_record = WL_record %>%
group_by(western) %>%
mutate(Season_Rank = order(order(W.all, decreasing = T)))
WL_record = WL_record %>%
mutate(Playoff = Season_Rank <= 8)
conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")
# plots to see if 2point, 3point are good indicators of W/L%
team_stats_new <- merge(team_stats,west_ind,by="Tm")
p_3_1plot <- team_stats_new %>%
ggplot(aes(x = `Team3PA`, y=`Opponent3PA`))+
geom_point(aes(col = team_stats_new$`W/L%`))+
scale_color_gradient(low="red", high="green",name = "W/L%")+
facet_grid(~western, labeller = as_labeller(conference_names))+
labs(title = "Team 3-point Shot Attempts and W/L% (Since Season 2015-16)")+
xlab("Total 3-point shot attempts by team")+
ylab("Total 3-point shot attempts by opponent")
p_3_1plot
p_3_2plot <- team_stats_new %>%
ggplot(aes(x = `Team3P%`, y=`Opponent3P%`))+
geom_point(aes(col = team_stats_new$`W/L%`))+
scale_color_gradient(low="red", high="green",name = "W/L%")+
facet_grid(~western, labeller = as_labeller(conference_names))+
labs(title = "Team 3-point Accuracy and W/L% (Since Season 2015-16)")+
xlab("Percentage of attempted 3-point shots made by team")+
ylab("Percentage of attempted 3`-point shots made by opponent")
p_3_2plot
team_stats_new <- merge(team_stats,west_ind,by="Tm")
p_2_1plot <- team_stats_new %>%
ggplot(aes(x = `Team2PA`, y=`Opponent2PA`))+
geom_point(aes(col = team_stats_new$`W/L%`))+
scale_color_gradient(low="red", high="green",name = "W/L%")+
facet_grid(~western, labeller = as_labeller(conference_names))+
labs(title = "Team 2-point Shot Attempts and W/L% (Since Season 2015-16)")+
xlab("Total 2-point shot attempts by team")+
ylab("Total 2-point shot attempts by opponent")
p_2_1plot
p_2_2plot <- team_stats_new %>%
ggplot(aes(x = `Team2P%`, y=`Opponent2P%`))+
geom_point(aes(col = team_stats_new$`W/L%`))+
scale_color_gradient(low="red", high="green",name = "W/L%")+
facet_grid(~western, labeller = as_labeller(conference_names))+
labs(title = "Team 2-point Accuracy and W/L% (Since Season 2015-16)")+
xlab("Percentage of attempted 2-point shots made by team")+
ylab("Percentage of attempted 2-point shots made by opponent")
p_2_2plot
The NBA regular season consists of each team playing 82 games, 41 at home and 41 away. We examined home and away records of teams to better understand home-court advantage in the NBA. We can see that most teams have a better record playing at home with a few exceptions (i.e. CHI & MIA). The home-court advantage is more prominent for some teams (e.g. DEN & SAS) while mattering less for other teams (e.g. GSW & OKC).
home_away_plot = function(year){
game_stats_year = get_game_year_feature(year)
standing_url = paste0("https://www.basketball-reference.com/leagues/NBA_",
year,"_standings.html")
standing_year = standing_url %>%
read_html %>%
html_nodes(xpath = '//comment()') %>% # select comment nodes
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to a single string
read_html() %>% # reparse to HTML
html_table() %>%
.[[1]]
colnames(standing_year) = paste0(colnames(standing_year),standing_year[1,])
standing_year = standing_year[-1,]
standing_table = standing_year %>%
select(Team,Overall,PlaceHome,PlaceRoad,ConferenceE,ConferenceW) %>%
separate(PlaceHome,into = c("HomeW","HomeL"),sep = "-") %>%
separate(PlaceRoad,into = c("AwayW","AwayL"),sep = "-") %>%
separate(ConferenceE,into = c("EastW","EastL"),sep = "-") %>%
separate(ConferenceW,into = c("WestW","WestL"),sep = "-") %>%
separate(Overall,into = c("Wtot","Ltot"),sep = "-") %>%
inner_join(west_ind,by=c("Team" = "TeamName"))
standing_table_gather = standing_table %>%
gather(stats,win,-c("Team","Wtot","Ltot","Tm","western"))
standing_table_gather[,c(2,3,7)] = as.numeric(as.matrix(standing_table_gather[,c(2,3,7)]))
conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")
p = standing_table_gather %>%
filter(stats %in% c("HomeW","AwayW")) %>%
ggplot(aes(x = reorder(Tm,Wtot), y = win,fill= stats))+
geom_bar(stat = "identity",position = "dodge")+
facet_grid(~western,scales="free_x", labeller = as_labeller(conference_names))+
labs(title = paste0('Team Records (Season ',year-1,'-',substr(year,3,4),")"))+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position = "bottom")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
xlab("Teams")+
ylab("Number of wins")+
scale_fill_manual(values = c("#F8766D", "#00BFC4"), labels = c("Away wins", "Home wins"))+
theme(legend.title = element_blank())
print(p)
team_factor_levels= standing_table_gather %>%
filter(western==1 & stats %in% c('EastW') | western==0 & stats %in% c('WestW')) %>%
arrange(win) %>%
select(Tm) %>% as.matrix() %>% as.character()
p2 = standing_table_gather %>%
mutate(Tm = factor(Tm,levels = team_factor_levels)) %>%
filter(western==1 & stats %in% c('EastW','EastL') | western==0 & stats %in% c('WestW','WestL')) %>%
ggplot(aes(x =Tm, y = win,fill= stats))+
geom_bar(stat = "identity")+
labs(title = paste0('Team Records (Season ',year-1,'-',substr(year,3,4),")"))+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position = "bottom")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
xlab("Teams")+
ylab("Number of games")+
scale_fill_manual(values = c("#F8766D","#7CAE00","#00BFC4","#C77CFF"), labels = c("Loss against East","Win against East","Loss against West","Win against West"))+
theme(legend.title = element_blank())
return(p2)
}
p2 <- home_away_plot(2019)
#png("./plots/home_away.png",width = 960,height = 480)
#p2
#dev.off()
The regular season schedule allows for each team to play the 15 teams from the opposing conference twice a year, so a total of 30 games are played against opposing conference teams. We often hear that the Western Conference is better than the Eastern Conference, but we wondered how true this is. We examined records of matches where teams from opposing conferences faced each other. From the plot, we can see that more teams from the Western Conference (teams with bars colored green and pink) lie to the right end of the plot, which indicates that these teams have high winning percentages against teams from the opposing conference. So the Western Conference does seem to be the stronger conference.
#png("./plots/west_east.png",width = 960,height = 480)
#p2
#dev.off()
p2
We next considered whether a team’s performance in a game is dependent on the results of the team’s previous games. Could a winning streak boost a team’s confidence? Or does a winning streak mean a team has been playing extremely hard on the court and will eventually suffer from greater fatigue? To investigate the impact of previous match results, we looked at match records of teams in the game following win or loss streaks of different lengths. In the 2018-19 season, the longest winning streak was 11 games by the Golden State Warriors and the longest losing streak was 18 games by the New York Knicks. For each team, we applied the Chi-Square test of independence to assess the relationship between streaks and results of the following match. Test results indicated that a statistically significant relationship existed only for the Philadelphia 76ers. A boxplot is used to visualize how the overall performances of all teams in the league vary after different streaks (only winning or losing streaks of up to 3 are shown since other longer streaks are much less common).
library(data.table)
library(rowr)
team_multiple_game_data = function(year){
teams <- unique(game_stats$Visitor.Neutral)
game_after_streak <- list()
game_consecutive <- list()
for (team in teams) {
### wrangling team winning streak data
team_game_stats <- game_stats %>% filter(grepl(as.character(year-1), Season))
team_game_stats <- team_game_stats %>% filter(Visitor.Neutral == team | Home.Neutral == team)
team_game_stats <- team_game_stats[complete.cases(team_game_stats[, "PTS"]),]
# compute winning streaks
team_game_stats$isWin <- 0
team_game_stats[team_game_stats$Home.Neutral == team & (team_game_stats$PTS.1 > team_game_stats$PTS), "isWin"] = 1
team_game_stats[team_game_stats$Visitor.Neutral == team & (team_game_stats$PTS.1 < team_game_stats$PTS), "isWin"] = 1
team_game_stats$preGameWinStreak <- rowid(rleid(team_game_stats$isWin)) * team_game_stats$isWin
na.omit(setDT(team_game_stats)[, preGameWinStreak:=c(NA, preGameWinStreak[-.N])])
# compute losing streaks
team_game_stats$isLose <- 1 - team_game_stats$isWin
team_game_stats$preGameLoseStreak <- rowid(rleid(team_game_stats$isLose)) * team_game_stats$isLose
na.omit(setDT(team_game_stats)[, preGameLoseStreak:=c(NA, preGameLoseStreak[-.N])])
# overall streak
team_game_stats$streak <- paste0("W", team_game_stats$preGameWinStreak)
team_game_stats[team_game_stats$streak == "W0", "streak"] <- paste0("L", team_game_stats$preGameLoseStreak[team_game_stats$streak == "W0"])
# game results after streaks
game_after_streak[[team]] <- as.data.frame.matrix(table(team_game_stats$streak, team_game_stats$isWin))
### wrangling back-to-back game data
team_game_stats$GameDate <- as.Date(substring(team_game_stats$Date, 6), "%b %d, %Y")
team_game_stats$consecutive <- c(NA,diff(as.Date(team_game_stats$GameDate)) == 1)
# game results after back-to-back game
game_consecutive[[team]] <- as.data.frame.matrix(table(team_game_stats$consecutive, team_game_stats$isWin))
}
return(list("game_after_streak"=game_after_streak, "game_consecutive"=game_consecutive))
}
team_multiple_game_data_2019 <- team_multiple_game_data(2019)
# chi-square test for independence
tests<-lapply(team_multiple_game_data_2019$game_after_streak, chisq.test)
names(which(sapply(tests, function(f) f$p.value) < 0.05))
## [1] "Philadelphia 76ers"
# merge winning streak data of all teams
all_team_streaks <- bind_rows(team_multiple_game_data_2019$game_after_streak, .id = "team")
all_team_streaks$Streak <- unlist(lapply(team_multiple_game_data_2019$game_after_streak, rownames))
colnames(all_team_streaks)[2:3] <- c("lose","win")
all_team_streaks$winProb <- all_team_streaks$win/(all_team_streaks$win + all_team_streaks$lose)
# plot
all_team_streaks_plot <- all_team_streaks %>% filter(Streak %in% c("L3","L2","L1","W1","W2","W3"))
all_team_streaks_plot$Streak <- factor(all_team_streaks_plot$Streak,levels = c("L3","L2","L1","W1","W2","W3"))
ggplot(data=all_team_streaks_plot, aes(x=Streak, y=winProb)) +
geom_boxplot(show.legend = F) +
labs(title = "Team Performances After Winning or Losing Streak (Season 2018-19)") +
ylab("Proportion of game following streak won")
The NBA regular season is packed into a tight schedule of 82 games for each team, so sometimes a team finds itself having to play on consecutive (back-to-back) nights, leaving less time for rest and recovery in between games, which may take a toll on the players. We were wondering whether as a result, teams perform worse if facing a back-to-back schedule. Using game records from the 2018-19 season, we see that the average proportion of games won by all teams in the league when the game is a back-to-back game is somewhat lower compared to the winning proportion if the teams had extra days to rest. Chi-Square tests of independence indicated that a statistically significant relationship between whether a game was played back-to-back and whether the game was won or lost existed only for the Washington Wizards.
# merge and plot back-to-back game data of all teams
all_team_consecutive <- bind_rows(team_multiple_game_data_2019$game_consecutive, .id = "team")
all_team_consecutive$consecutive <- unlist(lapply(team_multiple_game_data_2019$game_consecutive, rownames))
colnames(all_team_consecutive)[2:3] <- c("lose","win")
all_team_consecutive$winProb <- all_team_consecutive$win/(all_team_consecutive$win + all_team_consecutive$lose)
ggplot(data=all_team_consecutive, aes(x=consecutive, y=winProb)) +
geom_boxplot(show.legend = F) +
labs(title = "Team Performances In Back-to-back Games (Season 2018-19)") +
ylab("Proportion of game won") +
xlab("Back-to-back game or not")
# chi-square test for independence
tests<-lapply(team_multiple_game_data_2019$game_consecutive, chisq.test)
names(which(sapply(tests, function(f) f$p.value) < 0.05))
## [1] "Washington Wizards"
Basketball is a team sport, but the value of individual talent is not to be overlooked. Win shares is a metric that is designed to estimate an individual player’s contribution to the team in terms of wins. We identify the win shares of the player with the most win shares on each team (only considering players with win shares in the top 100 of the league). We can see that teams that have a dominant player tend to perform better. Note that some teams are missing from the plot because none of the players on the team is ranked in the top 100.
team_ws_plot = function(year){
team_stats_year = get_team_year_feature(year)
game_stats_year = get_game_year_feature(year)
colnames(team_stats_year)[c(2,3)] = c("W.pre","L.pre")
WL_record = cbind(west_ind,W = 0, L=0)
WL_record = WL_record_cal(game_stats_year,WL_record)
colnames(WL_record)[c(4,5)] = c("W.all","L.all")
WL_record = WL_record %>%
group_by(western) %>%
mutate(Season_Rank = order(order(W.all, decreasing = T))) %>%
inner_join(team_stats_year[,c(1,2,3,14,15)],by=c("Tm","TeamName"))
Winshare_feature = get_ws_injury_trans_data(year)
WL_record_ws = WL_record %>%
inner_join(Winshare_feature, by='Tm')
conference_names <- c(`1` = "Western Conference", `0` = "Eastern Conference")
WL_record_ws <- WL_record_ws %>%
group_by(western) %>%
ungroup() %>%
arrange(western, desc(Season_Rank)) %>%
mutate(order = row_number())
p = WL_record_ws %>%
ggplot(aes(WS_max, order))+
geom_segment(aes(x = 0,
y = order,
xend = WS_max,
yend = order),
color = "black")+
geom_point(aes(col = factor(western)),size = 7)+
geom_text(aes(label=WS_max),color="white",size=2.5)+
facet_wrap(~ factor(western), scales="free", labeller = as_labeller(conference_names))+
labs(title = paste0('Win Shares on Teams (Season ',year-1,'-',substr(year,3,4),")"))+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position = "bottom")+
guides(colour=FALSE)+
xlab("Win shares of player with most win shares on each team")+
ylab("End of regular season conference standings")+
scale_y_continuous(
breaks = WL_record_ws$order,
labels = paste0(WL_record_ws$Season_Rank,". ",WL_record_ws$Tm)
)
print(p)
}
#png("./plots/team_ws_plot.png",width = 960,height = 480)
team_ws_plot(2019)
#dev.off()
We next examine how players’ win shares vary over NBA seasons. We identify the 20 players in the league who have had among the highest single-season win shares during the 2014-15 to 2018-19 season timeframe, and examine how the win shares of these players varied over those five seasons. We see that the win shares of these top players are fairly consistent over seasons. Never does one of these player fail to place in the top 100 in the league in terms of win shares, and over half of these players have consistently made the top 20 list.
get_ws_raw_data = function(year){
ws_url = paste0("https://www.basketball-reference.com/play-index/psl_finder.cgi?request=1&match=single&type=totals&per_minute_base=36&per_poss_base=100&season_start=1&season_end=-1&lg_id=NBA&age_min=0&age_max=99&is_playoffs=N&height_min=0&height_max=99&year_min=",
year,"&year_max=",year,"&birth_country_is=Y&as_comp=gt&as_val=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=ws")
ws_stats = ws_url %>% read_html %>%
html_table() %>%
.[[1]]
ws_stats = ws_stats[,c(2,3,5,7)]
colnames(ws_stats)= ws_stats[1,]
ws_stats = ws_stats[-c(1,which(ws_stats$Player=='Player')),]
ws_stats$WS = as.numeric(ws_stats$WS )
return(ws_stats)
}
player_ws_plot = function(){
year_list = c(2015:2019)
ws_stats = NULL
for(i in 1: length(year_list)){
ws_tmp = get_ws_raw_data(year_list[i])
ws_stats = rbind(ws_stats, ws_tmp)
}
ws_stats = ws_stats %>% arrange(-WS)
player_rank_list = unique(ws_stats$Player)
ws_stats_rank = ws_stats %>%
group_by(Season) %>%
mutate(WSRank = rank(-WS,ties.method = "first"))
p = ws_stats_rank %>%
filter(Player %in% player_rank_list[1:20]) %>%
ggplot(aes(x = Player, y = WSRank,fill = Player))+
geom_boxplot()+
labs(title = "Consistency of Player Win Shares Over 2014-15 to 2018-19 Season")+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position = "none")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ylab("Win shares ranking in league")
print(p)
}
#png("./plots/player_ws_plot.png",width = 960,height = 480)
player_ws_plot()
#dev.off()
The performance of a team could be captured by both team-level and player-level statistics.
The team-level features included in our current model are:
The player-level features that are incorporated in the team performance metrics include:
library(igraph)
library(tcltk)
library(dplyr)
#dat<-read.csv("./data/game_stats.csv",stringsAsFactors = F)
dat <- game_stats %>%
filter(Season=='2018-19')
dat$raw_points<-abs(dat$PTS.1-dat$PTS)
dat$win<-ifelse(dat$PTS.1-dat$PTS>0,dat$Home.Neutral,dat$Visitor.Neutral)
dat$lose<-ifelse(dat$PTS.1-dat$PTS>0,dat$Visitor.Neutral,dat$Home.Neutral)
dat_igraph<-data.frame(A=dat$lose,B=dat$win,weight=dat$raw_points)
dat_igraph<-dat_igraph%>%group_by(A,B)%>%summarise(weight=mean(weight))
dat_graph<-graph.data.frame(d=dat_igraph,directed = TRUE)
is_weighted(dat_graph)
# plot.igraph(dat_graph)
#tkplot(dat_graph)
adj_matrix<-get.adjacency(dat_graph,attr="weight")
rank <- sort(page.rank(dat_graph,directed=T)$vector,decreasing = T)
##first twenty
dat_igraph_20<-data.frame(A=dat$lose[1:300],B=dat$win[1:300],weight=dat$raw_points[1:300])
dat_igraph_20<-dat_igraph_20%>%group_by(A,B)%>%summarise(weight=mean(weight))
dat_graph_20<-graph.data.frame(d=dat_igraph_20,directed = TRUE)
is_weighted(dat_graph_20)
#tkplot(dat_graph)
adj_matrix<-get.adjacency(dat_graph_20,attr="weight")
rank <- sort(page.rank(dat_graph_20,directed=T)$vector,decreasing = T)
#plot
west_ind_sort<-west_ind[match(get.vertex.attribute(dat_graph_20,"name"), west_ind$TeamName),]
dat_graph_20<-set_vertex_attr(dat_graph_20,"western",value=west_ind_sort$western)
dat_graph_20<-set_vertex_attr(dat_graph_20,"Tm",value=as.character(west_ind_sort$Tm))
colrs<-c("gray50","tomato")
V(dat_graph_20)$color <- colrs[as.factor(V(dat_graph_20)$western)]
deg <- 2*degree(dat_graph_20, mode="in")
plot.igraph(dat_graph_20,
edge.arrow.size=0.3,
vertex.color=colrs[as.factor(V(dat_graph_20)$western)],
vertex.label=V(dat_graph_20)$Tm,
vertex.size=deg,
edge.width=4*E(dat_graph_20)$weight/max(E(dat_graph_20)$weight))
We used a generalized Bradley-Terry logistic regression model as the basis to generate winning probabilities of teams. We chose the logit model over more complex machine learning models for the following reasons:
\[ logit(Pr(i \succ j \ \text{in game} \ k )) = \lambda_i - \lambda_j + \delta \cdot I(i \in adv)+ \gamma^{\top} z_{ij} \]
where \(\lambda_i\) and \(\lambda_j\) are the team performance vectors defined based on feature engineering. \(I(I \in adv)\) is the indicator for home-court advantage, and \(z_{ij}\) are some potential effects not included in the team performance, such as western-eastern effect.
Cross-Entropy Loss: \[ \mathcal{L} = - \sum_{i=1}^N y_i log(\hat{p_i})+(1-y_i) log(1-\hat{p_i}) \]
Since there is overdispersion in the data, we use a quasibinomial model instead with dispersion parameter \(\approx\) 1.
We used data from the first 20 games of each team (total of 300 games combined) of the 2015-16 to 2017-18 seasons as the training set and the first 20 games of each team of the 2018-19 season as the test set. A generalized Bradley-Terry quasibinomial model was fit using the data. The model takes two teams into account at a time. Features selected (refer to Feature Engineering section above) are stored as vectors \(\lambda_i\) and \(\lambda_j\) for each team. The Bradley-Terry model is parametrized by the differences in terms of PTS (points), TRB (total rebounds), AST (assists), STL (steals), BLK (blocks), TOV (turnovers) and PF (personal fouls) between each of the two teams, plus an indicator of home-court advantage, and other trivial effects such as western-eastern effect. Cross-entropy loss was used to validate the model. The model would generate the probability of winning and losing for each of the two teams. If a team has a probability of winning that is greater than 0.5, we would regard the team as the winner, and vice versa.
In the following example, the home-court indicator (which is the intercept term in model summary) is statistically significant(p-value<1e-8). Team total personal fouls (p-value = 0.0251) and number of high winshare players (p-value = 0.028) are two other factors that contribute to winning. To interpret how the number of high winshare players could affect a game, if we increase one high winshare player in a team and hold other covariates unchanged, the odds of winning would be \(exp(0.2314) = 1.26\) (95% CI: (1.03, 1.55)) times of the original odds.
For illustration purposes, if we use the first 300 games of the 2016-17 season as the training set and the first 300 games of the 2017-18 season as the test set (each team having played approximately 20 games), then we get a prediction accuracy of 0.678 ([0.647,0.708]) with a sensitivity of 0.52 and specificity of 0.79.
library(caret)
get_train_test_data = function(year, firstN){
team_stats_year = get_team_year_feature(year)
game_stats_year = get_game_year_feature(year)
train_set = game_stats_year[-c(1:firstN),]
feature_idx = -c(1,3,4,14)
team_per_diff = data.frame(team_stats_year[match(train_set$Home,team_stats_year$TeamName),feature_idx]
- team_stats_year[match(train_set$Away,team_stats_year$TeamName),feature_idx])
train_set_teamper = data.frame(train_set,
team_per_diff = I(as.matrix(team_per_diff)))
return(list(data_logit = train_set_teamper, data_ML = data.frame(cbind(train_set, team_per_diff))))
}
build_logit_model_year = function(train_year = 2018, test_year = 2019, firstN = 300){
train_set = get_train_test_data(train_year,firstN)$data_logit
test_set = get_train_test_data(test_year,firstN)$data_logit
print(colnames(train_set))
## logit model
logit_model = glm(HomeWin~ team_per_diff ,family=quasibinomial(link='logit'),
data = train_set)
print(summary(logit_model)) # model summary
## training accuracy
print(confusionMatrix(factor(ifelse(logit_model$fitted.values>0.5,1,0)),
factor(train_set$HomeWin)))
## testing accuracy
pred_win_prob = predict(logit_model,newdata = test_set, type = "response")
pred_game = ifelse(pred_win_prob>0.5,1,0)
test_set_pred = cbind(test_set,pred_win_prob,pred_game)
print(confusionMatrix(factor(test_set_pred$pred_game),factor(test_set_pred$HomeWin)))
return(test_set_pred)
}
test_set_pred = build_logit_model_year(2017,2018)
## [1] "Date" "Home" "HomePTS" "Away"
## [5] "AwayPTS" "HomeWin" "team_per_diff"
##
## Call:
## glm(formula = HomeWin ~ team_per_diff, family = quasibinomial(link = "logit"),
## data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1058 -1.1285 0.6468 0.9682 1.9771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4174283 0.0720854 5.791 9.62e-09 ***
## team_per_diffW -0.0117562 0.0824854 -0.143 0.8867
## team_per_diffTeamPTS -0.0008002 0.0017669 -0.453 0.6507
## team_per_diffOpponentPTS -0.0022329 0.0017428 -1.281 0.2005
## team_per_diffTeamTRB -0.0014532 0.0018226 -0.797 0.4255
## team_per_diffTeamAST 0.0022276 0.0024742 0.900 0.3682
## team_per_diffTeamSTL 0.0004728 0.0040842 0.116 0.9079
## team_per_diffTeamBLK -0.0012661 0.0052851 -0.240 0.8107
## team_per_diffTeamTOV 0.0034292 0.0025849 1.327 0.1850
## team_per_diffTeamPF 0.0046855 0.0020886 2.243 0.0251 *
## team_per_diffwestern 0.1196188 0.1345914 0.889 0.3744
## team_per_diffConfRank -0.0167029 0.0563909 -0.296 0.7671
## team_per_diffWS_max 0.0435573 0.0326032 1.336 0.1819
## team_per_diffWS_num 0.2314029 0.1051208 2.201 0.0280 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 1.006972)
##
## Null deviance: 1258.0 on 929 degrees of freedom
## Residual deviance: 1139.6 on 916 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 178 110
## 1 202 440
##
## Accuracy : 0.6645
## 95% CI : (0.6331, 0.6948)
## No Information Rate : 0.5914
## P-Value [Acc > NIR] : 2.668e-06
##
## Kappa : 0.2789
##
## Mcnemar's Test P-Value : 2.579e-07
##
## Sensitivity : 0.4684
## Specificity : 0.8000
## Pos Pred Value : 0.6181
## Neg Pred Value : 0.6854
## Prevalence : 0.4086
## Detection Rate : 0.1914
## Detection Prevalence : 0.3097
## Balanced Accuracy : 0.6342
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 201 111
## 1 188 430
##
## Accuracy : 0.6785
## 95% CI : (0.6474, 0.7084)
## No Information Rate : 0.5817
## P-Value [Acc > NIR] : 8.014e-10
##
## Kappa : 0.3204
##
## Mcnemar's Test P-Value : 1.107e-05
##
## Sensitivity : 0.5167
## Specificity : 0.7948
## Pos Pred Value : 0.6442
## Neg Pred Value : 0.6958
## Prevalence : 0.4183
## Detection Rate : 0.2161
## Detection Prevalence : 0.3355
## Balanced Accuracy : 0.6558
##
## 'Positive' Class : 0
##
head(test_set_pred)
## Date Home HomePTS Away
## 1 Tue, Nov 28, 2017 Sacramento Kings 87 Milwaukee Bucks
## 2 Tue, Nov 28, 2017 Utah Jazz 106 Denver Nuggets
## 3 Wed, Nov 29, 2017 Detroit Pistons 131 Phoenix Suns
## 4 Wed, Nov 29, 2017 Orlando Magic 121 Oklahoma City Thunder
## 5 Wed, Nov 29, 2017 Philadelphia 76ers 118 Washington Wizards
## 6 Wed, Nov 29, 2017 New York Knicks 115 Miami Heat
## AwayPTS HomeWin team_per_diff.W team_per_diff.TeamPTS
## 1 112 0 -5.0 -156.0
## 2 77 1 -2.0 -100.0
## 3 107 1 7.0 -24.0
## 4 108 1 0.0 99.0
## 5 113 1 1.0 33.0
## 6 86 1 0.0 66.0
## team_per_diff.OpponentPTS team_per_diff.TeamTRB team_per_diff.TeamAST
## 1 25.0 14.0 -49.0
## 2 -121.0 -101.0 -44.0
## 3 -267.0 -86.0 77.0
## 4 225.0 -4.0 63.0
## 5 85.0 131.0 70.0
## 6 47.0 28.0 50.0
## team_per_diff.TeamSTL team_per_diff.TeamBLK team_per_diff.TeamTOV
## 1 -5.0 -28.0 15.0
## 2 31.0 24.0 -23.0
## 3 19.0 -41.0 -40.0
## 4 -56.0 0.0 16.0
## 5 16.0 14.0 57.0
## 6 -9.0 -3.0 0.0
## team_per_diff.TeamPF team_per_diff.western team_per_diff.ConfRank
## 1 -61.0 1.0 6.0
## 2 64.0 0.0 1.0
## 3 -87.0 -1.0 -10.0
## 4 -21.0 -1.0 0.0
## 5 60.0 0.0 -2.0
## 6 -5.0 0.0 -1.0
## team_per_diff.WS_max team_per_diff.WS_num pred_win_prob pred_game
## 1 -12.4 -1.0 0.3687277 0
## 2 4.6 1.0 0.8115673 1
## 3 6.7 1.0 0.7914600 1
## 4 -13.1 -3.0 0.1865677 0
## 5 -4.6 -2.0 0.4979266 0
## 6 -3.7 -2.0 0.4253650 0
We then proceeded with running simulations to predict the 2019-20 Western Conference playoff teams. We created the functions WL_record_cal( )
to calculate the number of wins and losses for each team given the game schedule, and game_simulation( )
to simulate the process 100 times. To realize this, we first generate 300 team encounters randomly as a game schedule. We simulate each game using rbinom
with the probability parameters equal to our predicted winning probabilities obtained from the generalized BT model. Given the numbers of wins and losses in the simulation, we would then rank the teams and determine the top 8 teams. Such a strategy would allow us to obtain for each team average, median, and confidence intervals of regular season wins as well as playoff probabilities.
Using the 2018-19 season as an illustration, our simulations result in HOU winning on average 62.4 of the 82 regular season games, and making it to the playoffs in all simulations, so in our mind the chance of HOU making the playoffs is 100%. On the other hand, we predict other teams (i.e. SAC and PHO) to have a 0% chance of making the playoffs, which means we are certain that these teams have no hopes of making the playoffs.
## Remaining 62 games
set.seed(2019)
B = 100
game_simulation = function(pred_result, B, year, Nfirst = 300){
game_stats_year = get_game_year_feature(year)
pretrain_set = game_stats_year[1:Nfirst,]
train_set = game_stats_year[-c(1:Nfirst),]
WL_record = cbind(west_ind,W = 0, L=0)
WL_record_pre = WL_record_cal(pretrain_set,WL_record)
if (year!=2020){
WL_record_post = WL_record_cal(train_set,WL_record)
WL_record_all = WL_record_cal(train_set,WL_record_pre)
WL_record_west = WL_record_all %>%
filter(western==1) %>%
arrange(-W)
}
if (year==2020){
WL_record_west = WL_record_pre
}
WL_simu_rep = NULL
for(k in 1:B){
simu_game = rep(0,nrow(pred_result))
for(i in 1:nrow(pred_result)){
simu_game[i] = rbinom(1,1, pred_result$pred_win_prob[i])
}
simu_result = pred_result %>%
select(Home,Away,HomeWin) %>%
mutate(HomeWin = simu_game)
# WL_record = cbind(west_ind,W = 0, L= 0)
WL_simu = WL_record_cal(simu_result,WL_record_pre)
WL_simu_west = WL_simu %>%
filter(western==1) %>%
mutate(TeamRank = rank(-W,ties.method = "random")) %>%
mutate(playoff = ifelse(TeamRank<=8,1,0))
WL_simu_rep = rbind(WL_simu_rep, WL_simu_west)
}
ans = list(simu = WL_simu_rep, result = WL_record_west)
return(ans)
}
simu_result = game_simulation(test_set_pred, B, year = 2018)
train_set_simu = simu_result$simu
train_set_summary = train_set_simu %>%
group_by(Tm) %>%
summarise(W_avg = mean(W),
L_avg = 82 - W_avg,
W_med = median(W),
L_med = 82 - W_med,
playoff_prob = sum(playoff)/B) %>%
arrange(-W_avg)
print(train_set_summary)
## # A tibble: 15 x 6
## Tm W_avg L_avg W_med L_med playoff_prob
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HOU 62.4 19.6 62 20 1
## 2 GSW 58.2 23.8 59 23 1
## 3 UTA 51.7 30.3 52 30 0.99
## 4 OKC 50.6 31.4 50 32 0.99
## 5 NOP 45.2 36.8 45 37 0.84
## 6 SAS 44.8 37.2 45 37 0.78
## 7 MIN 44.6 37.4 45 37 0.7
## 8 POR 43.6 38.4 44 38 0.7
## 9 LAC 42.3 39.7 42 40 0.580
## 10 DEN 39.4 42.6 40 42 0.33
## 11 LAL 33.9 48.1 34 48 0.05
## 12 MEM 33.8 48.2 33.5 48.5 0.02
## 13 DAL 33.0 49.0 32.5 49.5 0.02
## 14 SAC 27.8 54.2 28 54 0
## 15 PHO 22.0 60.0 22 60 0
WL_record_west = simu_result$result
print(WL_record_west)
## Tm western TeamName W L
## 1 HOU 1 Houston Rockets 65 17
## 2 GSW 1 Golden State Warriors 58 24
## 3 POR 1 Portland Trail Blazers 49 33
## 4 NOP 1 New Orleans Pelicans 48 34
## 5 UTA 1 Utah Jazz 48 34
## 6 OKC 1 Oklahoma City Thunder 48 34
## 7 MIN 1 Minnesota Timberwolves 47 35
## 8 SAS 1 San Antonio Spurs 47 35
## 9 DEN 1 Denver Nuggets 46 36
## 10 LAC 1 Los Angeles Clippers 42 40
## 11 LAL 1 Los Angeles Lakers 35 47
## 12 SAC 1 Sacramento Kings 27 55
## 13 DAL 1 Dallas Mavericks 24 58
## 14 MEM 1 Memphis Grizzlies 22 60
## 15 PHO 1 Phoenix Suns 21 61
As an extension, we also considered the year effect, or in other words the effect of clustering of different years’ data. Intuitively, this is important because if data from the 2019-20 season resemble data from one of the past seasons but not another, then the data more dissimilar from that of the 2019-20 season would be less representative.
To explore whether clustering effects exist, we created the function team_pc_plot()
, which first uses principal component analysis (PCA) to reduce our higher dimensional data with a larger set of variables to a smaller set of just two principal components (PCs) that still manages to retain most of the information in the original data. The new coordinate system and transformed data defined in terms of the two PCs are shown. Concentration ellipses are then used to encircle data of all teams from the same season. From the graph, we can see that data for separate years are closely packed together and there is no clear pattern of a year-effect.
We also visualized the distance between every pair of seasons with a season distance matrix. We can see from the matrix that the closer together two years are, the smaller the in-between distance is, which is in line with our intuition.
library(factoextra)
combine_season_data = function(){
year_list = c(2016:2020)
team_stats = NULL
for(i in 1:length(year_list)){
year = year_list[i]
team_stats_year = get_team_year_feature(year)
team_stats_year = team_stats_year %>%
mutate(Season = paste0(year-1,'-',substr(year,3,4)),
year = year)
team_stats = rbind(team_stats,team_stats_year)
}
return(team_stats)
}
team_stats_all = combine_season_data()
## team pc plot
team_pc_plot = function(){
#team_stats_all = combine_season_data()
team_stats_X = team_stats_all %>%
select(W,L,`W/L%`,TeamPTS, OpponentPTS,western,ConfRank,WS_max,WS_num)
team_stats_pc = prcomp(team_stats_X,scale. = T)$x[,1:2] # select first two PCs
rownames(team_stats_X) = paste(team_stats_all$Tm,team_stats_all$Season)
team_stats_pc = data.frame(team_stats_pc,team_stats_all[,c('Tm', 'TeamName','Season','western','year')])
league_summary = team_stats_pc %>%
group_by(Season,year) %>%
summarise(PC1_med = median(PC1),
PC2_med = median(PC2)) %>%
as.data.frame()
rownames(league_summary) = league_summary$Season
p1 = # team_stats_pc %>%
ggplot(data = team_stats_pc, aes(x= PC1, y= PC2,col = factor(year)))+
geom_point()+
geom_text(aes(label=paste(Tm,Season)),size = 2, vjust = 0, nudge_y = 0.1)+
labs(title = 'PC plots for NBA teams from 2015-16 to 2019-20')+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position = "bottom")+
geom_point(aes(x=PC1_med,y=PC2_med),data = league_summary, size = 5, shape = 7)
# print(p1)
p1.pca = fviz_pca_ind( prcomp(team_stats_X,scale. = T),
#label = "none", # hide individual labels
habillage = team_stats_all$year, # color by groups
addEllipses = TRUE, # Concentration ellipses
labelsize = 3)+
labs(title = "PC plot of NBA teams from 2015-16 to 2019-20")+
theme(plot.title = element_text(hjust = 0.5))
print(p1.pca)
year_dist = dist(league_summary[,-1],diag = T,upper = T)
p3 = fviz_dist(year_dist,
order=F,gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))+
labs(title = "Season Distance Matrix")+
theme(plot.title = element_text(hjust = 0.5))
print(p3)
return(year_dist)
}
year_dist = team_pc_plot()
We created another function team_cluster_plot()
that uses K-means clustering to group teams from multiple seasons into 5 clusters. The choice of 5 was based on the fact that we have data from 5 NBA seasons. We see that the algorithm results in clusters where observations from the same season are not necessarily grouped together in the same cluster. Instead, it seems that teams are clustered together based on how strong they are. For example, teams in the olive cluster are among the most competitive teams (e.g. GSW 2015-16), while teams in the blue cluster are among the least competitive teams (e.g. PHI 2015-16). Thus, again, it seems that season/year patterns are not apparent.
## kmeans clustering plot
library(factoextra)
team_cluster_plot = function(){
team_stats_X = team_stats_all %>%
select(W,L,`W/L%`,TeamPTS, OpponentPTS,western,ConfRank,WS_max,WS_num) %>%
as.data.frame()
rownames(team_stats_X) = paste(team_stats_all$Tm,team_stats_all$Season)
k5 = kmeans(scale(team_stats_X), centers = 5, nstart = 15)
p = fviz_cluster(k5, data = team_stats_X,
labelsize = 5)+
labs(title = "cluster plot of NBA teams from 2015-16 to 2019-20")+
theme(plot.title = element_text(hjust = 0.5))
print(p)
}
team_cluster_plot()
Despite the year-effect not being highly evident in our data, to make our model more precise and flexible, we still would like to compare the difference between a model with and without the year-effect taken into account. Our original model presented above involves the training of a single model using data across multiple seasons. To consider a model which incorporates the year effect, we created the function weighted_game_prediction( )
, which trains a separate model for each year, and then makes weighted predictions, with the weight proportional to the inverse of the distance between a year and its following year.
Since we do not observe year-effect in our data, the prediction accuracy of the extended model does not improve compared to the original model. However, this function is still of great use as it provides users with an additional option and users can always choose the model that better suits their data.
weighted_game_prediction = function(pred_result,year_dist){
year = (ncol(pred_result))/2
print(year)
year_dist_mat = as.matrix(year_dist)
year_weight = (1/year_dist_mat[year+1,1:year])/sum(1/year_dist_mat[year+1,1:year])
print(year_weight)
for(i in 1:nrow(pred_result)){
pred_result$pred_win_weighted_prob[i] = sum(pred_result[i,seq(1,2*year,by = 2)] * year_weight)
pred_result$pred_game_vote[i] = ifelse(pred_result$pred_win_weighted_prob[i]>=0.5,1,0)
}
return(pred_result)
}
win_prob_prediction = function(train_year = c(2016,2017,2018), test_year = 2019, firstN = 300,
year_effect = 0){
## ignore year effect
if (year_effect == 0){
train_set = NULL
for(i in 1:length(train_year)){
train_set_tmp = get_train_test_data(train_year[i],firstN)$data_logit
train_set = rbind(train_set, train_set_tmp)
}
test_set = get_train_test_data(test_year,firstN)$data_logit
## logit model
logit_model = glm(HomeWin~ team_per_diff,family=binomial(link='logit'),
data = train_set)
print(summary(logit_model)) # model summary
## training accuracy
print("training accuracy")
print(confusionMatrix(factor(ifelse(logit_model$fitted.values>0.5,1,0)),
factor(train_set$HomeWin)))
## testing accuracy
pred_win_prob = predict(logit_model,newdata = test_set, type = "response")
pred_game = ifelse(pred_win_prob>0.5,1,0)
test_set_pred = cbind(test_set,pred_win_prob,pred_game)
if (test_year!=2020){
print("testing accuracy")
print(confusionMatrix(factor(test_set_pred$pred_game),factor(test_set_pred$HomeWin)))
}
return(test_set_pred)
}
# consider year effect
if (year_effect == 1){
test_set = get_train_test_data(test_year,firstN)$data_logit
test_set_pred = test_set
for(i in 1:length(train_year)){
train_set= get_train_test_data(train_year[i],firstN)$data_logit
logit_model = glm(HomeWin~ team_per_diff,family=binomial(link='logit'),
data = train_set)
pred_win_prob = predict(logit_model,newdata = test_set, type = "response")
pred_game = ifelse(pred_win_prob>0.5,1,0)
test_set_pred = data.frame(test_set_pred,pred_win_prob,pred_game)
}
a = weighted_game_prediction(test_set_pred[,-c(1:7)],year_dist = year_dist)
ans = cbind(test_set_pred[,c(1:7)],a)
if (test_year!=2020){
print(confusionMatrix(factor(a$pred_game_vote),factor(test_set_pred$HomeWin)))
}
return(ans)
}
}
win_prob_pred20_all = win_prob_prediction(train_year = c(2016:2019),test_year = 2020)
##
## Call:
## glm(formula = HomeWin ~ team_per_diff, family = binomial(link = "logit"),
## data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3765 -1.0816 0.6199 0.9596 2.2414
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4186805 0.0363912 11.505 < 2e-16 ***
## team_per_diffW 0.0439647 0.0262862 1.673 0.09442 .
## team_per_diffTeamPTS -0.0004349 0.0007099 -0.613 0.54016
## team_per_diffOpponentPTS -0.0013161 0.0006936 -1.897 0.05777 .
## team_per_diffTeamTRB 0.0004538 0.0006851 0.662 0.50773
## team_per_diffTeamAST 0.0022221 0.0007812 2.845 0.00445 **
## team_per_diffTeamSTL 0.0010245 0.0014531 0.705 0.48076
## team_per_diffTeamBLK -0.0021718 0.0014684 -1.479 0.13914
## team_per_diffTeamTOV 0.0013060 0.0013544 0.964 0.33490
## team_per_diffTeamPF 0.0018898 0.0008735 2.164 0.03050 *
## team_per_diffwestern 0.1192149 0.0623308 1.913 0.05580 .
## team_per_diffConfRank -0.0130932 0.0166226 -0.788 0.43089
## team_per_diffWS_max 0.0290981 0.0095124 3.059 0.00222 **
## team_per_diffWS_num 0.2158275 0.0342267 6.306 2.87e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5042.1 on 3719 degrees of freedom
## Residual deviance: 4475.1 on 3706 degrees of freedom
## AIC: 4503.1
##
## Number of Fisher Scoring iterations: 4
##
## [1] "training accuracy"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 760 432
## 1 774 1754
##
## Accuracy : 0.6758
## 95% CI : (0.6605, 0.6908)
## No Information Rate : 0.5876
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3081
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4954
## Specificity : 0.8024
## Pos Pred Value : 0.6376
## Neg Pred Value : 0.6938
## Prevalence : 0.4124
## Detection Rate : 0.2043
## Detection Prevalence : 0.3204
## Balanced Accuracy : 0.6489
##
## 'Positive' Class : 0
##
win_prob_pred20_weight = win_prob_prediction(train_year = c(2016:2019),test_year = 2020,year_effect = 1)
## [1] 4
## 2015-16 2016-17 2017-18 2018-19
## 0.1249718 0.1764922 0.2431241 0.4554119
win_prob_pred20_weight = win_prob_pred20_weight[,c(1:7,16,17)]
B = 100
simu_result = game_simulation(win_prob_pred20_all, B, year = 2020)
train_set_simu = simu_result$simu
train_set_summary = train_set_simu %>%
group_by(Tm) %>%
summarise(W_avg = mean(W),
L_avg = 82 - W_avg,
W_med = median(W),
L_med = 82 - W_med,
playoff_prob = sum(playoff)/B) %>%
arrange(-W_avg)
WL_record_west = simu_result$result %>%
filter(western==1)
print(train_set_summary)
## # A tibble: 15 x 6
## Tm W_avg L_avg W_med L_med playoff_prob
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 LAL 61.7 20.3 61 21 1
## 2 LAC 60.6 21.4 60 22 1
## 3 DEN 57.8 24.2 58 24 1
## 4 UTA 52.0 30.0 52 30 1
## 5 HOU 51.4 30.6 52 30 1
## 6 DAL 47.5 34.5 48 34 0.99
## 7 OKC 40.7 41.3 41 41 0.75
## 8 MIN 37.2 44.8 37 45 0.44
## 9 POR 36.6 45.4 37 45 0.43
## 10 SAC 34.9 47.1 35 47 0.2
## 11 SAS 33.2 48.8 33 49 0.09
## 12 PHO 33.0 49.0 33 49 0.09
## 13 NOP 30.0 52.0 30 52 0.01
## 14 MEM 25.8 56.2 26 56 0
## 15 GSW 19.6 62.4 19 63 0
saveRDS(WL_record_west,"./data/pre_20_games.rds")
saveRDS(win_prob_pred20_all,"./data/post_62_games_pred.rds")
get_schedule_data(year,month)
: obtain the game schedules and results in specific year and monthcombine_game_data()
: combine the game schedules dataget_team_stats(year)
: obtain the team-level data (30 teams in the league) of first 20 games in specific yearget_ws_data(year)
: obtain the win share data in specific yearget_team_year_feature(year)
: obtain the team performance feature vectors in specific yearget_game_year_feature(year)
: obtain the game results in tidy format in specific yearget_injury_data(year)
, combine_injury_data()
: obtain and combine the injury datateam_def_off_plot(year)
: show each team’s defense vs offense in specific yearteam_rank_plot(year,Nfirst)
: show each team’s season rank vs first N games’ rank in specific yearhome_way_plot(year)
: show each team’s home and away records in specific year; show western vs eastern recordsteam_ws_plot(year)
: show each team’s highest winshare in specific yearplayer_ws_plot()
: show the winshare of top players in the league in recent yearsteam_multiple_game_data(year)
: wrangle data for team winnnig/losing streaks as well as team back-to-back gamesget_train_test_data(year,firstN)
: obtain the training and test data of first N games in specific year.build_logit_model_year(train_year, test_year, firstN)
: use the first N games of train_year to build the model, and test the model on test_year. return the prediction results (W/L) and winning probability of each game.combine_season_data()
: combine the team performance feature data of multiple seasonsteam_pc_plot()
: create the PC plot for the team performances of all the 30 teams in multiple seasons; the distance matrix of season datateam_cluster_plot()
: create the cluster plot for each team in multiple seasonswin_prob_prediction(train_year, test_year,firstN,year_effect)
: The winning probability prediction function allow for year_effectWL_record_cal(game_schedule,WL_record)
: calculate the W/L records for each team given the game schedulegame_simulation(pred_result,B,year,Nfirst)
: given the predict results in specific year, repeat the game simulation for B timesThe National Basketball Association (NBA) is a professional basketball league in North America that draws the attention of many basketball fans all over the world. The 2019-20 NBA season is currently ongoing, with the regular season having started in October and scheduled to end in April 2020, at which point the playoffs will begin. Rather than wait until next April, as NBA fans well trained in data science, we decided to work on a project with the objective of predicting the Western Conference NBA playoff teams for the current season. Along the process, we also hope to explore what factors can contribute to teams winning more games and thus being able to make the playoffs.
The target audience we have in mind are NBA data analysts and NBA fans. We have prepared an analysis report and a project website that contain some of the more technical details of the analysis that may be of more interest to analysts, as well as a screencast that captures some of the major findings of the project, which NBA fans will hopefully enjoy learning about.
We used data scraped mainly from Basketball Reference (https://www.basketball-reference.com/) with additional injury data scraped from spotrac (https://www.spotrac.com/nba/injured-reserve/). There was the need to scrape the data because there are no well-formatted datasets readily available and there are multiple data tables within the websites that are of interest to us. After the scraping and tidying of the data, the dataset contains game-level, team-level and player-level statistics that we hope to further explore.
To select features that can best predict team performance, we first performed exploratory data analysis. In this process, we learned that on the team-level, the combination of team points and opponent points is needed to capture team performance. We also learned that while over half of the teams that rank in the top eight at the end of the first 20 games will later make the playoffs, there is still a long way to go (62 remaining games) until the end of the regular season, and conference standings are still subject to many changes. In addition, while 3-pointers are becoming a more common way of scoring in the league and many of the best teams in the league in the past few seasons are known to have some of the best 3-point shooters in NBA history, there are no obvious patterns of how volume or accuracy of 3-point shot attempts by teams are related to team winning percentages. Team-level statistics are important in defining team performance, but the value of individual talent on teams is not to be overlooked. Most of the teams that make the playoffs have one or multiple star players (as characterized by a high win shares stats), so it is important to consider their presence on the team and whether they are in good health to play. Asides from team-level and player-level statistics, there are other factors that affect the outcome of a game including where the game is played (i.e. whether there is a home-court advantage) and who the game is played against (e.g. an opponent from the Western Conference may be stronger and harder to beat).
Through the exploratory data analysis process, we decided that the performance of a team could be captured by team-level, player-level statistics and other specific effects. Team-level statistics include: team total/average points, opponent total/average points and other secondary statistics such as total rebounds, assists, blocks, turnovers, personal fouls. Player-level statistics include the highest win share value on the team and number of league top 100 win share players on the team, both taking into account severe player injuries and team transfers. Other specific effects include: home-court advantage and western-eastern effect.
To incorporate these features into our prediction, we built a generalized Bradley-Terry logistic regression model. The model is parametrized by the features selected above. We used a cross-entropy loss to validate the model. The model would generate the probability of wins and losses for each of the two teams. If a team has a probability of winning that is greater than 0.5, we would regard the team as the winner, and vice versa. We then used the winning probabilities given by the model in simulations. We first generated 300 team encounters randomly as a game schedule. We would then assign the probability of winning for each team based on our previous model prediction results. Given the numbers of wins and losses in the simulation, we would then rank the teams and determine the top eight teams. We also considered a model that accounts for year-effect, or in other words the effect of clustering of different years’ data. However, we do not observe an apparent year-effect in our data, so this extended model did not improve our prediction accuracy.
Based on our analysis above, the eight teams in the Western Conference we predict to make the playoffs this season are: Los Angeles Lakers Los Angeles Clippers Denver Nuggets Utah Jazz Houston Rockets Dallas Mavericks Oklahoma City Thunder Minnesota Timberwolves
The major limitation of this project is that the team feature vector used in the prediction model may not be representative of a team at a particular point in time. A team’s ability is subject to change as the season progresses. For one thing, accidental injuries occur often during games. If a star player gets seriously injured like Kevin Durant did last year, the team’s performance will be heavily undermined. Our model accounts for current known player injuries, but cannot anticipate injuries that will occur later in the season. In addition, a team’s roster may change as players get traded.
To improve our current model, three more features could be further explored: streak effect, fatigue (i.e. back-to-back games), and coaching staff. Our analysis did confirm the independence between streak/fatigue and game results for most teams individually. However, the p-value for the t-test between the marginal winning probability and the conditional winning probabilities given fatigue is 0.06 when incorporating all the teams. If we could come up with other ways to characterize streak and fatigue, the two features may help increase the model accuracy. Also, a coach is key to a team and incorporating coaching effects may give better predictions. Besides predicting playoffs, data could be used to help teams improve performances. We are interested in using player statistics and tracking data to investigate decision making during a game. Some of the potential questions are: what is the optimal lineup for a team when facing a specific opponent and at a specific time; what is the best shooting location for a player; at a given time and place, should a player pass the ball or shoot, etc.