Interoperability is the name of the game
Accessing Public Baseball Data in Julia
tl;dr - see this GitHub gist
For a couple years now, I’ve been super interested in the Julia language. One issue I had when when I was doing public-facing baseball work, is that there are great libraries in both Python (pybaseball) and R (baseballr) for loading in baseball data, but no such library for Julia (yet!). Luckily, Julia has great interoperability support, so we can utilize those libraries to pull baseball data into Julia DataFrames - it just takes a little bit of massaging.
Prerequisite: a working Python installation with pybaseball installed, which can be installed via pip. I recommend creating a designated Python virtual environment to work with Julia, and when you build PyCall, set
ENV["PYTHON"] = venv/bin/python3. Activate that virtual environment and run
pip install pybaseball
For interoperability with Python, Julia has PyCall.jl. Once loaded into Julia, use
pyimport to load pybaseball into your Julia session. The methods within pybaseball return Pandas Dataframes, which If you’re interested in using Pandas.jl, the conversion is straightforward, however it’s not trivial to get to Julia’s DataFrames. The approach I’ve found is to immediately use the
pandas.DataFrame.to_csv, method without a file to get the dataframe as a string. Then, read that in as an IOBuffer to CSV.jl, and sink it to a Juila Dataframe.
And for an example plot…
Prerequisite: a working R installation with baseballr installed. Open R and run:
Interoperability with R is done via RCall.jl. RCall can load R libraries via the
@rlibrary macro, which can then be used to call
baseballr (provided the library is installed). Once the library is loaded, then you can call functions via an R string, and use
rcopy to migrate an R dataframe to a Julia one.
And there you have it!
Hopefully this enables some easier baseball analysis for others in Julia. Of course, all this work can be circumnavigated by saving dataframes from respective packages as CSVs and reading them in via
CSV.jl, but who wants a million csvs laying around? There’s probably much more performant ways to go about this, but these approaches seem the quickest and most clear to me - if you have ideas or suggestions, feel free to reach out, or possibly comment on the git gist above.
I got a job!
Revisiting some old work, and handling some heteroscadasticity
Using a Bayesian GLM in order to see if a lack of fans translates to a lack of home-field advantage
An analytical solution plus some plots in R (yes, you read that right, R)
okay… I made a small mistake
Creating a practical application for the hit classifier (along with some reflections on the model development)
Diving into resampling to sort out a very imbalanced class problem
Or, ‘how I learned the word pneumonoultramicroscopicsilicovolcanoconiosis’
Amping up the hit outcome model with feature engineering and hyperparameter optimization
Can we classify the outcome of a baseball hit based on the hit kinematics?
A summary of my experience applying to work in MLB Front Offices over the 2019-2020 offseason
Busting out the trusty random number generator
Perhaps we’re being a bit hyperbolic
Revisiting more fake-baseball for 538
A deep-dive into Lance Lynn’s recent dominance
Fresh-off-the-press Higgs results!
How do theoretical players stack up against Joe Dimaggio?
I went to Pittsburgh to talk Higgs
If baseball isn’t random enough, let’s make it into a dice game
Or: how to summarize a PhD’s worth of work in 8 minutes
Double the Higgs, double the fun!
A data-driven summary of the 2018 Reddit /r/Baseball Trade Deadline Game
A 2017 player analysis of Tommy Pham