Predicting NBA Scores Part 1

Our baseline model

Feb 24, 2024

The Plan

We’re going to start a long-running series where we take one model, interrogate it, and iteratively improve it. Although we typically prefer more obscure models like modeling how consistent players are, for this series, we’re going mainstream: predicting NBA final score lines. Why did we decide to model something fairly monotonous:

It’s complicated enough to keep us busy through 2025
Everyone is interested in this.

The Starting Model

Our starting model is extremely simple, with many flaws. It’s a hierarchical model where each team’s offense and defensive strengths are simultaneously learned and modeled hierarchically. As always, the full model code is at the bottom of the post.

In this post, we’ll show the model and what it’s outputting. In later posts, we’ll dive into the specific limitations and work on iteratively improving them.

One thing to note is that this model is predicting total regulation points. In the future, overtime might be incorporated. Surely this won’t cause confusion in later posts when we stop mentioning this detail.

Game Predictions

Since we have a fully Bayesian model, we can get point estimates (pun), but there is also uncertainty in every estimate. So we can get not just which team our model thinks will win, but their win probability. And not just how many points they are likely to win by, but: what’s the percent chance they win by 10 points? what’s the percent chance they win by 11 points? etc.

Predictions for tonight’s match-ups:

Bayesian Power Rankings

Although we have a deeply limited and flawed model, we can still get offensive and defensive power rankings out of it. As a reminder, everyone can publish power rankings, but we are the only outlet that puts error bars on our power rankings.

Offensive Power Rankings

Defensive Power Rankings

Look ahead

There’s enough low hanging fruit with this model to tackle. If there’s something in particular you hate about it, let me know and I can prioritize that.

The Model

// Heirarchical IRT regression
//
// This models the points of home and away teams
// as a function of the latent offensive and defensive 
// strength of the teams.

data {
    // Number of games
    int<lower=1> N_games;

    // Number of teams in the league
    int<lower=1> N_teams;

    // Home and away points scored in each game
    array[N_games] int<lower=0> home_points;
    array[N_games] int<lower=0> away_points;

    // Team index for each game
    array[N_games] int<lower=1, upper=N_teams> home_team;
    array[N_games] int<lower=1, upper=N_teams> away_team;
}

parameters {
    // Latent offensive and defensive strength of each team
    // Hierarchical prior
    vector[N_teams] theta_offense;
    vector[N_teams] theta_defense;
    real theta_offense_bar;
    real theta_defense_bar;
    real<lower=0> sigma_offense_bar;
    real<lower=0> sigma_defense_bar;

    // Noise in the points (same for home and away teams)
    real<lower=0> sigma_points;

    real home_field_advantage;
}

model {

    // Priors
    // Average strength of the teams
    theta_offense_bar ~ normal(116, 10);

    // Home field advantage, about 2 points
    home_field_advantage ~ normal(2, 2);

    // Variations of the teams strength
    sigma_offense_bar ~ cauchy(0, 5);
    sigma_defense_bar ~ cauchy(0, 5);

    // Individual team strength
    theta_offense ~ normal(theta_offense_bar, sigma_offense_bar);
    theta_defense ~ normal(0, sigma_defense_bar);

    // Gaussian noise in the points
    sigma_points ~ cauchy(0, 5);

    // Likelihood
    for(game in 1:N_games) {
        // Team points modeled as gaussian
        real home_points_regression = home_field_advantage + theta_offense[home_team[game]] + theta_defense[away_team[game]];
        real away_points_regression = theta_offense[away_team[game]] + theta_defense[home_team[game]];
        home_points[game] ~ normal(home_points_regression, sigma_points);
        away_points[game] ~ normal(away_points_regression, sigma_points);    
    }
}

generated quantities {

    // Remove the mean from the latent variables
    vector[N_teams] theta_defense_centered;

    for (i in 1:N_teams) {
        theta_defense_centered[i] = theta_defense[i] - mean(theta_defense);
    }

    vector[N_teams] theta_offense_centered;

    for (i in 1:N_teams) {
        theta_offense_centered[i] = theta_offense[i] - mean(theta_offense);
    }
}