Accessing Public Baseball Data in Julia

10 Jan 2022

tl;dr - see this GitHub gist


For a couple years now, I’ve been super interested in the Julia language. One issue I had when when I was doing public-facing baseball work, is that there are great libraries in both Python (pybaseball) and R (baseballr) for loading in baseball data, but no such library for Julia (yet!). Luckily, Julia has great interoperability support, so we can utilize those libraries to pull baseball data into Julia DataFrames - it just takes a little bit of massaging.


Prerequisite: a working Python installation with pybaseball installed, which can be installed via pip. I recommend creating a designated Python virtual environment to work with Julia, and when you build PyCall, set ENV["PYTHON"] = venv/bin/python3. Activate that virtual environment and run pip install pybaseball

For interoperability with Python, Julia has PyCall.jl. Once loaded into Julia, use pyimport to load pybaseball into your Julia session. The methods within pybaseball return Pandas Dataframes, which If you’re interested in using Pandas.jl, the conversion is straightforward, however it’s not trivial to get to Julia’s DataFrames. The approach I’ve found is to immediately use the pandas.DataFrame.to_csv, method without a file to get the dataframe as a string. Then, read that in as an IOBuffer to CSV.jl, and sink it to a Juila Dataframe.

using DataFrames, PyCall, CSV
pybaseball = pyimport("pybaseball")
python_df = pybaseball.statcast("2021-04-06")
julia_df = CSV.read(IOBuffer(python_df.to_csv()), DataFrame)

And for an example plot…

using StatsPlots
@df filter(
    row -> row[:events] in ["field_out", "single", "double",  "triple", "home_run"], 
    dropmissing(julia_df, :events)
    ) StatsPlots.scatter(
        xlabel="Launch Speed", 
        ylabel="Exit Angle"


Prerequisite: a working R installation with baseballr installed. Open R and run: devtools::install_github("BillPetti/baseballr").

Interoperability with R is done via RCall.jl. RCall can load R libraries via the @rlibrary macro, which can then be used to call baseballr (provided the library is installed). Once the library is loaded, then you can call functions via an R string, and use rcopy to migrate an R dataframe to a Julia one.

using RCall
@rlibarary baseballr
julia_df = rcopy(R"baseballr::scrape_statcast_savant(start_date = '2021-04-06', end_date = '2021-04-06')"

And there you have it!

Hopefully this enables some easier baseball analysis for others in Julia. Of course, all this work can be circumnavigated by saving dataframes from respective packages as CSVs and reading them in via CSV.jl, but who wants a million csvs laying around? There’s probably much more performant ways to go about this, but these approaches seem the quickest and most clear to me - if you have ideas or suggestions, feel free to reach out, or possibly comment on the git gist above.