Linear Regression: NBA Edition

February 2, 2017

As a longtime basketball fan, I’ve always been saddened at the eventual decline of once great players. I’ve often wondered if age does matter in the NBA or do some coaches value experience moreso? In this post, I’d like to explore whether age and number of games played can predict future playing time. My hypothesis is that players who are older tend to see less playing time. With all good supervised Machine Learning alogirthms, we need good data. In this example, I query with the help of the NBA’s API documentation from seemethere. The data comes from the 2015-16 NBA season and the individiual data points are averages of every NBA player’s last 10 games.

To start, we should download the data and put it somewhere accessible

# In this example, I download the file to my ~/Downloads directory
$ cd ~/Downloads
$ curl "" > nba_players_2016.json

# Upon opening this data, you'll realize it's not a csv but I wrote nodejs code to parse it into one for your convenience
$ git clone [email protected]:bigthyme/ml-lessons.git
$ cd linear-regression
$ ls -al

Next, we should pick our modeling technique. We’ll be using the closed form Normal Equations and comparing that to the results from Gradient Decent. Because I’m learning from Professor Ng’s course, I’ll be writing my algorithms in the language of Octave. Thus, it may be prudent to follow this tutorial on Octave prior to following my examples.

Let’s create a file named normalEqn.m where we will write out our normal equation in octave

% declare a normalEqn function in Octave
function [theta] = normalEqn(X, y)

% initialize theta to a matrix of equal dimensions as expected result
theta = zeros(size(X, 2), 1);

% calculate the inverse matrix of the training data
theta = pinv(X)*y


For clarity, I’m going to create another file for calling the normalEqn.m function and adding useful logs

printf('Loading up dataset for testing...\n');
data = csvread('player-defense_2015.csv');

% grab the proper columns from the csv
X = data(:, 1:2);

% grab the mins played column
y = data(:, 3);
m = length(y);

% Add intercept term to X
X = [ones(m, 1) X];

% Calculate the parameters from the normal equation
theta = normalEqn(X, y);

% Display the normal equation's result
fprintf('Theta computed from the normal equations: \n');
fprintf(' %f \n', theta);
% theta should be
% theta =
 %  13.3900
 %   9.5308
 %   3.3927

Turns out it’s hard to predict this model as 13.39 minutes doesn’t explicitly tell us much about an individual player but we could say that 13 minutes in general is quite low for a full-time NBA player. To improve this existing model, we could’ve tracked an individual player separately throughout their career and run the algorithm according to that individual’s dataset. But let’s say we did infact have more explicit data on an individual player, how would we further validate our theta value given this model? One way to validate the hypothesis is to use a different algorithm to double check our work. If we take the iterative approach of gradient descent we can visualize the model and verify if the cost function is being minimized. To use gradient descent, you have to make an educated guess as to what the learning rate (or alpha) might be along with guessing the number of iterations a model needs to be useful. In the graph below, I used the following variables:

alpha = 0.01;
number_of_iterations = 500;

Gradient Descent Test

As you can see with our current data it does trend in a good direction but doesn’t in fact converge well (read more about how to further debug results from this technique). Again, we could’ve dramatically improved the model but I think it’s ok to run models and find out they don’t work well. As I continue to run this type of analysis I hope to get better at invalidating models more quickly from a cursory glance.

comments powered by Disqus