Tyler James Burch

Applying research methods, statistics, machine learning, and Bayesian modeling to solve problems in professional baseball analytics.

Accessing Public Baseball Data in Julia

tl;dr - see this GitHub gist

Background

For a couple years now, I’ve been super interested in the Julia language. One issue I had when when I was doing public-facing baseball work, is that there are great libraries in both Python (pybaseball) and R (baseballr) for loading in baseball data, but no such library for Julia (yet!). Luckily, Julia has great interoperability support, so we can utilize those libraries to pull baseball data into Julia DataFrames - it just takes a little bit of massaging.

pybaseball

Prerequisite: a working Python installation with pybaseball installed, which can be installed via pip. I recommend creating a designated Python virtual environment to work with Julia, and when you build PyCall, set ENV["PYTHON"] = venv/bin/python3. Activate that virtual environment and run pip install pybaseball

For interoperability with Python, Julia has PyCall.jl. Once loaded into Julia, use pyimport to load pybaseball into your Julia session. The methods within pybaseball return Pandas Dataframes, which If you’re interested in using Pandas.jl, the conversion is straightforward, however it’s not trivial to get to Julia’s DataFrames. The approach I’ve found is to immediately use the pandas.DataFrame.to_csv, method without a file to get the dataframe as a string. Then, read that in as an IOBuffer to CSV.jl, and sink it to a Juila Dataframe.

using DataFrames, PyCall, CSV
pybaseball = pyimport("pybaseball")
python_df = pybaseball.statcast("2021-04-06")
julia_df = CSV.read(IOBuffer(python_df.to_csv()), DataFrame)

And for an example plot…

using StatsPlots
@df filter(
    row -> row[:events] in ["field_out", "single", "double",  "triple", "home_run"], 
    dropmissing(julia_df, :events)
    ) StatsPlots.scatter(
        :launch_speed, 
        :launch_angle, 
        group=:events, 
        alpha=0.5, 
        xlabel="Launch Speed", 
        ylabel="Exit Angle"
    )

center

baseballr

Prerequisite: a working R installation with baseballr installed. Open R and run: devtools::install_github("BillPetti/baseballr").

Interoperability with R is done via RCall.jl. RCall can load R libraries via the @rlibrary macro, which can then be used to call baseballr (provided the library is installed). Once the library is loaded, then you can call functions via an R string, and use rcopy to migrate an R dataframe to a Julia one.

using RCall
@rlibarary baseballr
julia_df = rcopy(R"baseballr::scrape_statcast_savant(start_date = '2021-04-06', end_date = '2021-04-06')"

And there you have it!

Hopefully this enables some easier baseball analysis for others in Julia. Of course, all this work can be circumnavigated by saving dataframes from respective packages as CSVs and reading them in via CSV.jl, but who wants a million csvs laying around? There’s probably much more performant ways to go about this, but these approaches seem the quickest and most clear to me - if you have ideas or suggestions, feel free to reach out, or possibly comment on the git gist above.

Cite This Post

@misc{burch2022baseball-in-julia,
  author       = {Tyler James Burch},
  title        = {Accessing Public Baseball Data in Julia},
  year         = {2022},
  month        = {January},
  howpublished = {\url{https://tylerjamesburch.com/blog/baseball/baseball-in-julia}},
}

2026 6
2025 1
2023 2
2022 3
2021 2
2020 13
2019 9
2018 4
2017 1

Accessing Public Baseball Data in Julia

Background

pybaseball

baseballr

And there you have it!

Cite This Post

2026

2025

2023

2022

2021

2020

2019

2018

2017