How We Built Our Win Probability Model for WNBA Totals in The Action Network App
Stephen Gosling/NBAE via Getty Images. Pictured: Kia Vaughn #1 and Alanna Smith #11 of the Phoenix Mercury
By William Doyle and Carlon Brown
The last two decades have seen a major shift in how people approach data. This began to seriously pervade the sports world as early as 2002, with the success of that year’s Oakland Athletics.
The next data invasion has already begun in our sports betting world, with live win probability in The Action Network app, PRO Systems, and more. It will continue to grow as sports betting becomes legal in more states.
The demand for sports betting related content, products and analysis continues to grow. This includes offerings for new leagues, such as the WNBA. The Action Network has seen an almost three-fold increase in the number of users who track WNBA picks in the app year over year.
And due to the increased demand for WNBA betting content and analysis over the past year, we decided to make win probability models for it. It’s not in the app yet, but we’re hoping to get it live soon.
The current model is a Random Forest regressor which predicts the win total of a given WNBA game. This model when evaluated on the test set of data achieved an R^2 = 0.94.
Here’s how we got there.
The Action Network has historical play-by-play data for multiple professional and amateur sports leagues, including the WNBA. In total, 694 complete WNBA games were gathered with an average of 387 plays per game.
Statistics not already included in play-by-play data were created, for example, a running points per minute (all statistics, including calculated statistics, are located in Table I). All play-by-play data and associated statistics were preprocessed to properly handle missing/null values for model training and testing.
In addition, each game’s closing total line from a single sportsbook was gathered. Each statistic except for the end of game total in Table I was a feature input in the model. The target value was the end of game total.
The model’s input was a 1-dimensional array of cardinality 14, and its output was a single value.
Figure I. Schematic Representation
Multiple models were selected to be trained by the training set of data. The top three models with no hyperparameter tuning were chosen and a random search was performed on each.
The Random Forest Regressor performed best. An exhaustive grid search was performed with the Random Forest model, and the best hyperparameters were selected.
The model performed well with an R^2 = 0.94, however, the model predicted the total of the game and did not yield a probability of a bet winning. A probability of 0.5 was assumed for the first few plays of the game (i.e. opentip, jumpball) before adjusting to the model prediction.
To convert from a predicted win total to a probability, we needed a line from a sportsbook. All closing lines were gathered from one book for this analysis, however this conversion can be achieved with any book line for any game.
The difference of the predicted win total l_t and the book line b divided by the clock remaining in the game r_t provided a simple and intuitive likelihood of a win total bet cashing. Along with some scaling of the denominator, the resulting value was then forced to be between 0 and 1 (sigmoid function) in order to create an estimated probability interpretation.
Results and Discussion
As stated above, the final random forest model achieved an R^2 = 0.94 Feature importances were investigated (Table I below) to identify which parameters were most useful in predicting the win total. Note, it was not surprising that the points per minute statistic was the single most important predictor in predicting the win total for the game.
In addition it makes sense that two-point percentage and clock remaining were also important predictors. Surprisingly, the points per minute gradient had very little impact as a feature importance. This is surprising because the underlying data which the statistic was based off of itself was a powerful predictor and yet the associated statistic was almost negligible.
Also, we originally thought that close games would be useful in identifying more competitive games where a win total may be higher than usual, but this was not the case.
Table I. All Statistics & Feature Importances
There is still room for improvement with the model, notably expanding the feature input space. For example, instead of simply counting the number of fouls, turnovers, and other statistics, one can break it out by team. In doing so the model may be able to identify when teams are close to entering the bonus/double bonus for fouls and other possible relationships with other statistics when split up.
In addition, more WNBA play-by-play data may allow for more flexible models to be more accurate than the random forest.
Figure II. Probability vs. Time Graphs (6 WNBA Games)
One can model WNBA totals to predict the total for any given WNBA game with play-by-play data. The model predicted for many games (Figure 2) very rapid convergence to a probability of 5% or 95%. For these games the model was beating the live lines offered by sports books by a significant margin.
Additional analysis on how often this model’s rapid convergence creates profitable betting opportunities for bettors is still needed. In addition, a similar model can be constructed for point spreads and moneylines. It may even be possible to create the point spread model and obtain the moneyline for free.
Future models could determine games with potentially elevated variance between the actual total and the line total. If such a model could find such games at rates slightly higher than chance, it would further improve the likelihood of the proposed model’s profitability, due to its rapid convergence.
Such a model could be created from previous team information and individual player stats, as well as recent performances of players/teams. The next data invasion is here and the world of sports betting will continue to grow as sports betting becomes legal across the country.
Data & Analytics: Kyle Western
Engineering: Akshay Patel, Daniel Hood, Justyn Laufenberg, & Sam Huffman
Executive: Caroline Smith, Melissa Betts, TAN, & TCG