2023 NHL Playoff Predictions
Who will win this year’s cup?
In 2015, MLB introduced Statcast to all 30 stadiums. This system monitors player and ball movement through the combination of two tracking systems: a Trackman Doppler radar and HD Chyron Hego cameras. This has provided a wealth of new information to teams, but also has introduced many new terms to broadcasting parlance. Two specific terms, exit velocity and launch angle, have been used quite frequently since, with good reason - they’re very evocative of the action happening on the field.
The exit velocity is the speed of the ball off of the bat, the launch angle is the vertical angle off the bat (high values are popups, near zero values are horizontal, negative values are into the ground). When these started becoming more popular, I found myself thinking quite often, “how do I know if this is good or not?” With exit velocity, it’s fairly easy to conceptualize, but less transparent for launch angle. This led me to try plotting these two variables using hit outcome as a figure of merit. The shown chart uses data from the 2018 season.
There are some macro-trends of note here:
A while later, I came back to this plot and considered how one might model these trends. The pockets immediately made me think about k-Nearest Neighbor clustering, which was my first attempted model. I also took a look at some other classification models. The models I’ve investigated here1 are:
The results of each are shown below. These are all using very cookie-cutter models from Scikit-Learn, with nothing fancy added just yet.
This gave the following takeaways:
An appropriate benchmark for this is a bit non-obvious. This being an imbalanced dataset2, an appropriate way to understand if we’ve gained anything by modeling in the first place is to benchmark over just choosing majority label each time. In the dataset, 62% of the outcomes are outs, so accuracy above and beyond 62% is a gain - all three models perform better than this. In many imbalanced classification problems, the type of error matters heavily (with the canonical example being cancer treatment, where false positives/negatives impact treatment approaches), but it’s a bit less obvious what type of errors are “worse” for this problem - probably the best approach is to make sure the model is conservative toward higher valued hits. The confusion matrix gives a good idea of misclassifications.
What we can draw from this:
Up until now, this model has been built by looking at just the launch angle and exit velocity. If you think about launch angle as the angle in the z direction, and the exit velocity as the velocity vector, we’ve effectively3 parametrized how high and how far a ball is hit, but completely ignoring the third dimension, y, or the spray angle. This was done to address the original question, “when broadcasters speak about these values, what should I take from it?”
The third dimension is a very non-trivial factor though. The corners are between 300-350 ft from home, while straight-away is between 390-440 ft. First, we should take a look at how the current models are handling this (plots shown for gBDT):
The spray angle distribution looks similar for outs and singles, but the features for doubles and home runs are not at all present. This means that adding this variable will provide the models with additional information, which should help improve the accuracy. Adding this additional variable, retraining and retesting gives the following:
Adding this variable helped out the kNN and BDT models considerably. Both increased their correct predictions on outs, singles, doubles, and home runs. Most strikingly, the prediction for doubles went from 17% accurate to 43% accurate for the BDT (24% to 49% for kNN) - this is a huge improvement. Interestingly, the SVC went from an overly liberal decision boundary on singles to an overly conservative one, which actually caused a worse overall accuracy, despite having more information.
Ultimately, of the models tested, the best shown to classify hit outcomes based on launch angle, exit velocity, and spray angle is a BDT with a 78% accuracy. Future work on this model will focus on developing the BDT further. I’ll be following up this post with several more as I develop the BDT, which I’ll link here as they’re posted. Some topics I’m going to address:
The code for this post can be found in this Jupyter Notebook.
[1] Multi-class logistic regression was also attempted, but outcomes were far worse than other models, so it was scrapped quite quickly. I also tried a Random Forest classifier in addition to the gBDT, but found the BDT to be better so elected to use it as my tree-based model. More complicated models will be addressed in future posts, but for just a few inputs, simple models are ideal.
[2] A later post will look at methods of balancing the data, by resampling methods or by expanding the sample size.
[3] Having launch angle and exit velocity allows you to break down your velocity into \(x\) (away from the batter) and \(z\) (toward the sky) components. From there, you can use the constant acceleration from gravity to derive hangtime, and then use that to evaluate distance traversed in \(x\). This is all done via physics 1 kinematic equations: first \(v_z t = - \frac{1}{2} g t^2\), where \(g\) is the acceleration of gravity (-9.8 m/s), and \(v_z\) is the \(z\) component of velocity (this ignores the height of the batter, but is close enough). Solve this for hangtime \(t\). Then plug that into \(x = v_x t\) to find the distance traversed. From a model perspective, everything else in these equations are constants, so this information is effectively encoded.
Who will win this year’s cup?
Just how lucky have the 18-3 Bruins gotten?
Interoperability is the name of the game
I got a job!
Revisiting some old work, and handling some heteroscadasticity
Using a Bayesian GLM in order to see if a lack of fans translates to a lack of home-field advantage
An analytical solution plus some plots in R (yes, you read that right, R)
okay… I made a small mistake
Creating a practical application for the hit classifier (along with some reflections on the model development)
Diving into resampling to sort out a very imbalanced class problem
Or, ‘how I learned the word pneumonoultramicroscopicsilicovolcanoconiosis’
Amping up the hit outcome model with feature engineering and hyperparameter optimization
Can we classify the outcome of a baseball hit based on the hit kinematics?
A summary of my experience applying to work in MLB Front Offices over the 2019-2020 offseason
Busting out the trusty random number generator
Perhaps we’re being a bit hyperbolic
Revisiting more fake-baseball for 538
A deep-dive into Lance Lynn’s recent dominance
Fresh-off-the-press Higgs results!
How do theoretical players stack up against Joe Dimaggio?
I went to Pittsburgh to talk Higgs
If baseball isn’t random enough, let’s make it into a dice game
Or: how to summarize a PhD’s worth of work in 8 minutes
Double the Higgs, double the fun!
A data-driven summary of the 2018 Reddit /r/Baseball Trade Deadline Game
A 2017 player analysis of Tommy Pham