All NBA Team Prediction Using Lineal Models

By Guillem Miralles

8 min readJan 6, 2021

Everything explained in this article can be found in my GitHub: https://github.com/GuillemMiralles/All_NBA_Team

Additional information:
After writing this article, we have developed a Shiny APP that includes the information of this model, and NBA statistics in an interactive way. You can see it in the following link: https://guillemmiralles.shinyapps.io/5_42/

1. ABSTRACT

ALL NBA TEAM of the Year is an annual NBA award given to the best players of the season. Voting is by a group of sport journalists and announcers from the United States and Canada. The team has been chosen every NBA season, since its inauguration in 1946. The award consists of three teams of five players (first team, second team and third team). It originally consisted of two teams, but in 1988 it was increased to three.

Journalists and announcers vote for 5 players for the first team, 5 for the second one and 5 for the third one. The players score five points for each vote when the player is in the first team, three points for each vote in the second team, and one point for each vote in the third team. The five players with the highest total number of points enter the first team, the second by the following five players, and the same with the third team. But, there is one restriction: position. In each vote of 5 players (composing one team), 2 players are guard (in our data.frame “PG” and “SG”), the other 2 are forward, (“SF” and “PF”) and the last player is center (“C”). And the same for the other two teams.

These are basically the 15 best players of the season. We will look at the statistics made by these 15 players (in that season), and with this information, we will try to know if the player is in the All-NBA team, or not.

2. DATA

Our information for making this model is the database “Season_Stats.csv”. You can find it on https://www.kaggle.com/drgilermo/nba-players-stats. This database has all the information from 1980 to 2017 with all the players statistics for each season. We have many variables, a total of 53 . It may be that many of them are correlated and not good for our model, we’ll see later. But first, we have to do a small modification, we have to introduce variables that will provide us with information on whether the player has been in the best quintet of that season or hasn’t.

In the data we can find many statistics (all explained in the PDF “NBA_English.pdf” that you can find in my github repository: https://github.com/GuillemMiralles/All_NBA_Team) . Some of them are simple like the shots of 3 or the minutes played, and others more advanced like the contribution of one player defensively for every 100 possessions. As we know, some will be more relevant to our model than others.

3. PROCESS

The steps we have followed are as follows:

Reading and cleaning the dataframe (reading the csv, setting the variable type, changing null values to 0, deleting rows with duplicate players…)
Introduce a new variable into dataframe which tells us whether the player has been in the All-NBA team of that season or hasn’t.
We set up a training set and a test set. The training set will have data from 1980 to 2011, and the test set will have data from 2011 to 2017 (80% -20%).
We visualize the data and observe that many of the variables correlate with each other or do not provide us with relevant information. Therefore, having so many predictive variables, we want to regularize them. For example:

As we can see, the total number of 3 scored shots (X3P), the total number of 3 fired shots (3PA) and the percentage of 3 shots, scored/fired, (X3P.) are highly correlated variables (and the same goes for 2 shots, and more variables).

We can observe that many of the variables present us with information that is not entirely relevant. To know which ones we are going to use, we are going to use reduction techniques that help us find the best variables for our model. Dimensionality Reduction with LASSO helps us to select the best variables.

We perform a Multiple Logistic Regression (GLM) method with these variables in order to do the next step. This command tells us the best glm comparing method using AKAIKE information (AIC).
We see how the variables we are interested in are greatly reduced. As we are performing a logistic regression, in the variables we obtained from the previous point, we perform three models using three different methods which are the ones we will compare. These three methods are: Multiple Logistic Regression (GLM), Quadratic Discriminant Analysis (QDA), and Linear Discriminant Analysis (LDA). We do not take the KNN method because we already know that neighboring values are not interesting for predicting the next value.

GLM:

We make the prediction of the test data (2011–2017), and create a table where we can see the false and true positives, and the false and true negatives. We also make a ROC curve to compare the models.

QDA:

We do the same with the QDA method.

LDA:

And finally, with the LDA method.

Which one do we choose?

We choose the GLM method as it is the one that best predicts true positives and negatives. On the one hand it is the one that reduces the false positives the most (actually what interests us to the mistakes that the model makes), but the false negatives are higher than the other models. We can say that all three models are good, but for the above reasons we will stick with the GLM

With the predictions of the model, we create a dataframe that contain the probability of the players to be in the All NBA Team, and more variables like if it really is, the name, the position … But we have a problem, in these teams there is a position restriction. For this reason, we will create a function that selects players to form a team (2 guards, 2 forwards and 1 center). What the function does is to choose (firstly) 2 guards with higher probability in the dataframe, 2 forwards and 1 center (for the first team), and it repeats the process with the second team and the third. And finally, we see the results

4. RESULTS:

- FOR YEAR 2015:

Checking the model in 2015, we appreciate that the results obtained seem very accurate. We have a database with many players every season, in this case 650, manages to predict 13 of the 15 players at the All- NBA team.

Knowing that voting is subjective depending on the player’s game, and not on his statistics, we see that our model explains these votes with a very high probability of success.

In the table of substitutions, these are the players who should be in the All NBA Team (Kyrie Irving and Klay Thompson) replacing those who have not been able to correctly predict our model (John Wall and Damian Lillard). We also show the probabilities that our model gives to these players. We see that our model would have put Kyrie Irving if it were not for the restriction of positions.

- FOR YEAR 2016:

This year is the year in which we find the most mistakes, especially focusing on the mistake of James Hardem who gives him a 97.8% probability of belonging to All-NBA teams. Researching a bit about the player we realize that he belongs to the quintets from 2013 to 2019 (with the exception of this year) and since 2014 he always appears in the first quintet.

We note that this year is the year in which the player got the fewest victories (a difference of 14 compared to other years), the mistake is because our model does not consider them. We think that this lack of victories influenced the voting. Although his individual statistics were very prominent.

In the table we can find which players and his probability to be in the All-NBA team replacing the model errors.

- FOR YEAR 2017:

This year we can consider that there are many players with a very high probability of belonging to the All-NBA team. We note that there are very few errors. There are two mistakes, and they are not in the top 10.

Karl-Anthony Towns of the Minnesota Timberwolves team, has a total of 31 victories and and 51 defeats. Being these the minimum of victories of all the predicted players. One thing we can also highlight is that this player was 16th in the All-NBA team positions, with 4 points less than Deandre Jordan who came in 15th. Gordon Hayward that year made the year in best statistics. It was his only year with more than 20 points per game played. It was the only year he was selected for the NBA All Star.

We see that Deandre Jordan has a probability of 10.64% and that he occupies the same position as Karl-Anthony Towns (the mistake of before). Deandre Jordan is a player with a very defensive facet, so he did not have very good statistics, but he has a very good reputation in the league. His team scored 20 more victories this year than the Karl-Anthony Towns team, also entering the playoffs at the top of the table.

5. CONCLUSIONS:

In conclusion, our model obtains a very high reliability when we talk mainly about the 5 players at the first team. Anyway, we think the reliability that our model gets is very high, as we saw in the results. Making the correct prediction in a total of 37 players out of 45 (82’22%). As we can see, of these 8 errors, only two fail to give a probability of more than 50%.

We appreciate that the team’s total victories are significant, as they are not numerous. This is the point where we think our model has found the most of the mistakes. According to our hypothesis, if we had a database from which to extract this possible variable, and also the variable if the team plays playoffs or not, our results would significantly improve the study.