Tyler James Burch

Particle Physics // Machine Learning // Music // Baseball


Classifying MLB Hit Outcomes - Part 1: Model Selection

21 Apr 2020

Understanding Launch Angle and Exit Velocity

In 2015, MLB introduced Statcast to all 30 stadiums. This system monitors player and ball movement through the combination of two tracking systems: a Trackman Doppler radar and HD Chyron Hego cameras. This has provided a wealth of new information to teams, but also has introduced many new terms to broadcasting parlance. Two specific terms, exit velocity and launch angle, have been used quite frequently since, with good reason - they’re very evocative of the action happening on the field.

Mike Trout Hitting Metrics

The exit velocity is the speed of the ball off of the bat, the launch angle is the vertical angle off the bat (high values are popups, near zero values are horizontal, negative values are into the ground). When these started becoming more popular, I found myself thinking quite often, “how do I know if this is good or not?” With exit velocity, it’s fairly easy to conceptualize, but less transparent for launch angle. This led me to try plotting these two variables using hit outcome as a figure of merit. The shown chart uses data from the 2018 season.

Hit outcomes by Launch Angle and Launch Speed

There are some macro-trends of note here:

Employing in a Prediction Model

A while later, I came back to this plot and considered how one might model these trends. The pockets immediately made me think about k-Nearest Neighbor clustering, which was my first attempted model. I also took a look at some other classification models. The models I’ve investigated here1 are:

The results of each are shown below. These are all using very cookie-cutter models from Scikit-Learn, with nothing fancy added just yet.

Predicted hit outcomes from various models

This gave the following takeaways:

An appropriate benchmark for this is a bit non-obvious. This being an imbalanced dataset2, an appropriate way to understand if we’ve gained anything by modeling in the first place is to benchmark over just choosing majority label each time. In the dataset, 62% of the outcomes are outs, so accuracy above and beyond 62% is a gain - all three models perform better than this. In many imbalanced classification problems, the type of error matters heavily (with the canonical example being cancer treatment, where false positives/negatives impact treatment approaches), but it’s a bit less obvious what type of errors are “worse” for this problem - probably the best approach is to make sure the model is conservative toward higher valued hits. The confusion matrix gives a good idea of misclassifications.

Confusion matrices for the various models evaluated

What we can draw from this:

Moving from 2D to 3D

Up until now, this model has been built by looking at just the launch angle and exit velocity. If you think about launch angle as the angle in the z direction, and the exit velocity as the velocity vector, we’ve effectively3 parametrized how high and how far a ball is hit, but completely ignoring the third dimension, y, or the spray angle. This was done to address the original question, “when broadcasters speak about these values, what should I take from it?”

The third dimension is a very non-trivial factor though. The corners are between 300-350 ft from home, while straight-away is between 390-440 ft. First, we should take a look at how the current models are handling this (plots shown for gBDT):

Spray angle for various events by hit type

The spray angle distribution looks similar for outs and singles, but the features for doubles and home runs are not at all present. This means that adding this variable will provide the models with additional information, which should help improve the accuracy. Adding this additional variable, retraining and retesting gives the following:

Predicted hit outcomes for the various models evaluated with spray angle

Confusion matrices for the various models evaluated with spray angle

Adding this variable helped out the kNN and BDT models considerably. Both increased their correct predictions on outs, singles, doubles, and home runs. Most strikingly, the prediction for doubles went from 17% accurate to 43% accurate for the BDT (24% to 49% for kNN) - this is a huge improvement. Interestingly, the SVC went from an overly liberal decision boundary on singles to an overly conservative one, which actually caused a worse overall accuracy, despite having more information.

Summary and Future Work

Ultimately, of the models tested, the best shown to classify hit outcomes based on launch angle, exit velocity, and spray angle is a BDT with a 78% accuracy. Future work on this model will focus on developing the BDT further. I’ll be following up this post with several more as I develop the BDT, which I’ll link here as they’re posted. Some topics I’m going to address:

The code for this post can be found in this Jupyter Notebook.


[1] Multi-class logistic regression was also attempted, but outcomes were far worse than other models, so it was scrapped quite quickly. I also tried a Random Forest classifier in addition to the gBDT, but found the BDT to be better so elected to use it as my tree-based model. More complicated models will be addressed in future posts, but for just a few inputs, simple models are ideal.

[2] A later post will look at methods of balancing the data, by resampling methods or by expanding the sample size.

[3] Having launch angle and exit velocity allows you to break down your velocity into \(x\) (away from the batter) and \(z\) (toward the sky) components. From there, you can use the constant acceleration from gravity to derive hangtime, and then use that to evaluate distance traversed in \(x\). This is all done via physics 1 kinematic equations: first \(v_z t = - \frac{1}{2} g t^2\), where \(g\) is the acceleration of gravity (-9.8 m/s), and \(v_z\) is the \(z\) component of velocity (this ignores the height of the batter, but is close enough). Solve this for hangtime \(t\). Then plug that into \(x = v_x t\) to find the distance traversed. From a model perspective, everything else in these equations are constants, so this information is effectively encoded.