Tyler James Burch

2025 Year in Review

2025-12-31T00:00:00+00:00

In case anyone is keeping score, it’s been a while since I’ve posted anything to this blog (assuming like most people, you consider 2 calendar years to be a long time). I’m going to chalk it up to being really locked in.

Several times this year, I’ve had an idea for a post that never came to fruition. I figured a good way to wrap up the year would be to collect some of those loose threads into a single lightning-round collection of thoughts.

So, without further ado, here’s what I did in 2025:

I wrote a lot of R

A little over a year ago, I switched groups within our analytics team, and now have a strong requirement to use R for building projects. As a result, 2025 was the first year where I wrote more lines of R than any other language.

A few reflections after a year of working in R full-time:

I used to advise people who wanted to break into baseball analytics that it didn’t matter what language they learned, just pick R or Python and learn as much as you can. If a job requires you to switch, you can learn the other on the job, but prioritize knowledge depth while learning. This year has reinforced that opinion; good design patterns are ubiquitous, and learning on the fly has not been arduous. Living in a world with LLMs helps too (more on that later).
That being said, I’m now of the opinion that it’s near impossible to be a proper statistician and not engage with some R. There are several non-negotiable statistics libraries that plainly don’t have reasonable non-R alternatives (cough mgcv cough). Certainly, there are ways around it if you go the Python route, but often it’s fitting a square peg in a round hole.
I still prefer working in Python, especially for production code. R has so many idiosyncrasies that still frustrate me. The pain point that continues to plague me more than any is silent failures and tragically uninformative error messages. (Seriously, what does object of type 'closure' is not subsettable even mean?)

I made some small contributions to tidymodels

Speaking of R, after some initial skepticism, I’ve grown pretty fond of the tidymodels framework.

I'm here to walk back this take after 3 months. The upfront pain provides a lot of really good guardrails against doing really stupid statistical malpractice, and also makes downstream stuff (e.g. model tuning) trivially easy.
— Tyler Burch (@tylerjamesburch.com) March 7, 2025

Following my personal policy of helping out the libraries that help me, I made a couple of small PRs to tidymodels, probably the most interesting being adding fold weighting to the tune package.

For context, tune handles hyperparameter tuning within tidymodels. For each candidate hyperparameter set, you evaluate performance across resampling folds and take the set with the best average performance. Prior to this change, each fold contributed equally to that average, regardless of how much data it represented. That assumption is fine in a classic K-Fold setup, but if you switch to something like expanding window CV where folds have variable sizes, it seemed to miss the mark.

With this change, folds can now be weighted (naturally by training set size in the expanding window example) so later, more informative folds carry more influence when selecting hyperparameters. It’s a very small tweak, but it fixed a real issue I ran into on a project.

I’ve thought a lot about forecasting problems

Related to the above, more than before, I’ve found myself approaching problems through a forecasting lens. The largest project I worked on this year wasn’t explicitly a forecasting task, but required conditioning on time due to distributional shifts over time.

One of the clearest implications was cross-validation. I have increasingly used rolling or expanding window setups for cases when time could even plausibly matter, even if the problem isn’t explicitly framed as forecasting.

Lately, I’ve found myself defaulting to the assumption that time is relevant unless proven otherwise. Instead of asking whether I should use time-aware cross-validation, I enter with the posture that I need to be convinced it’s safe not to. Stationarity assumptions are helpful when true, but frequently break down in real-world problems, and ignoring that can lead to misleading performance estimates. Perhaps it can be a bit overcautious, but it’s one of those cases where a bit more care here has made me more comfortable with the results I’m delivering.

I used a lot of tokens

Agentic AI was impossible to ignore in 2025. I’ve long been bought in on codebase-aware LLMs. I find copy/pasting from ChatGPT both painfully slow and prone to stripping useful information. Because of this, I was a pretty early adopter of Cursor; I started using it in August of 2023. The tab complete was enough for me to buy in, and when agent-based edits came along, it became an important part of my workflow.

These days I’m using some combination of Cursor, Codex, and Claude Code for scaffolding boilerplate, generating prototypes (especially front-ends), quickly testing hypotheses, and making publication-quality plots faster than I could myself. The domain of tasks where it’s faster to prompt than to tweak, dissect, debug, etc. has grown way faster than I expected.

I don’t have any novel insights in this domain that haven’t been said elsewhere. The advice that I think about most day-to-day in my workflow is:

Be meticulous about context. Keep context window usage under 50% if possible. The smaller the haystack, the easier the needle is to find.
Spend time on prompting well. An extra 5 minutes on a good prompt can save an hour of debugging.
Be judicious about correctness. One “wrong” line in the context window can yield hundreds of bad lines of code, or subtle unexpected bugs. Clear, correct prompts are key.
Let the tool actually run code and see output. They’re better about this in other languages, but I find particularly in R they have to be pushed to do this. Add logging statements so it can iterate with itself and find bugs.

One thing I have noticed is that most of the public conversation around LLM tooling is focused more in the domain of traditional software engineering, where specs are far clearer than data analysis workflows. The projects I work on are fuzzier, and the road from question to answer (or from idea to predictions) is not typically well-defined from the outset. I haven’t seen nearly as much written about what works well in this environment and am still figuring it out day-by-day for myself.

I loosened my grip on statistical dogma

My first experience reading an honest stats text was McElreath’s Statistical Rethinking back in 2020 (10/10 recommend). Much of my early experience with doing statistics was strongly through a Bayesian lens. I used to feel pretty strongly that this was the best way to do things when possible.

In 2025, I let go of a lot of those biases. At the end of the day, I’m a practitioner; I need answers to problems. I’m not debating or writing about the philosophy of statistics, and in many of the problems I work on trying to wedge Bayesian inference into the solution can be a hindrance more than a value add.

If a frequentist framework can get me to effectively the same answer more quickly, I’ve become much more comfortable using it. In a lot of applied settings, the difference between a Bayesian posterior and a well-validated frequentist estimate is functionally negligible, while the difference in iteration speed isn’t. The Bayesian paradigm still shapes how I think about uncertainty, but I don’t reach for that machinery unless it provides something that will be used and I’m willing to wait for MCMC chains to finish sampling before I can provide an answer.

One of my repeated lines throughout this year has been “just use the right tool for the job,” which could even be a hidden thread behind this post. For me, the “right” tool is the one that answers the real question under real constraints and produces results that stakeholders can understand and act on. In practice, that’s often the approach that generates the most business value (or in my case, wins the most games) as quickly and correctly as possible.

I changed a lot of diapers

Above all else, I welcomed a second child into the world in May. At the time of writing this, I’m parenting a teething 7-month-old who gives the best smiles, as well as a curious and fiercely independent daughter who turned 3 yesterday.

I have a tendency to value myself entirely based on my work and the things I produce. Fatherhood is a constant forcing function to get out of that mindset and to enjoy life outside of sheer production. While it has exhausting moments, it has made me appreciate the day-to-day moments so much more.

Right now I’m most appreciating the sense of wonder from my toddler. I dread air travel, airports are awful, logistics are a nightmare, I could go on for hours. But hearing my daughter say “I’m so excited” while getting on a plane, and watching her stare with awe out the window during takeoff, has made me stop for a few moments and appreciate how cool life really can be. There are countless places where getting a chance to look through her lens has made me far more appreciative of the little things in life.

See you soon, hopefully before another 2 years go by, but no promises.

2023 Reading List

2023-12-29T00:00:00+00:00

Books Read in 2023:

How Not to Be Wrong: The Power of Mathematical Thinking - Jordan Ellenberg
Fluent in 3 Months: How Anyone at Any Age Can Learn to Speak Any Language from Anywhere in the World - Benny Lewis
Zak George’s Dog Training Revolution - Zak George
Winning Fixes Everything: How Baseball’s Brightest Minds Created Sports’ Biggest Mess - Evan Drellich
Heaven and Hell: A History of the Afterlife - Bart D. Ehrman
My Life as a Quant: Reflections on Physics and Finance - Emanuel Derman
The Checklist Manifesto: How to Get Things Right - Atul Gawande
The Big Short: Inside the Doomsday Machine - Michael Lewis

2023 NHL Playoff Predictions

2023-04-30T00:00:00+00:00

Background

Earlier this NHL season, I posted a Bayesian hierarchical model for NHL scoring in an aim to understand the skill of the Bruins based on the first 21 games (in which they went 18-3). This model has been expanded to better model NHL games (specifically the overtime structure), fit on all of 2022-2023 data to get the goal creation and suppression parameters for each team, and then used to project the remainder of the playoffs, which can be found here.

Methodology

Original Model

The base of the model has remained unchanged, based on the Baio and Blangiardo model. For regulation scoring, I fit the following model for goal scoring, $y = (y_{g0}, y_{g1})$, a Poisson process:

\[y_{gj} | \theta_{gj} \sim \text{Poisson}(\theta_{gj})\]

for game $g$, with $j = {0, 1}$ an indicator variable for home ice. In their paper the rate parameter $\theta_{gj}$ is given by the following:

\[\log \theta_{g0} = \alpha + h + a_{hg} - d_{vg}\] \[\log \theta_{g1} = \alpha + a_{vg} - d_{hg}\]

Where $h$ is the home ice advantage, $a$ is the “attack strength,” $d$ is the “defense strength,” and $h$ and $v$ denote “home” or “visitor” respectively. Last, $\alpha$ is a flat intercept. In words: the home scoring rate is proportional to the attack skill of the home team, minus the defense of the away team, plus home ice advantage. The away scoring rate is the opposite, with no advantage.

Overtime

However, some crucial updates have been made, specifically accounting for the overtime mechanisms in hockey. For games that reach overtime, the $\theta$ parameters are scaled down to the relative time frame

\[\theta_{h,o} = \theta_h \times \frac{1}{K},\] \[\theta_{a,o} = \theta_a \times \frac{1}{K}.\]

$K$ represents the scaling factor for overtime expectations. For the regular season, $K = 12$, since overtime is 5 minutes. In playoff games, $K$ is set to 3, as each overtime period lasts one-third of the time of a regulation period. This assumption implies that the goal creation and suppression parameters remain the same during overtime, a necessary compromise given the relatively small dataset of OT games.

I also introduce a custom likelihood function, which compares the observed home and away overtime goals to the expected overtime goal rates (ot_home_theta and ot_away_theta). This function is only applied to games that went into overtime. It allows only 3 possible outcomes:

No goals scored by either team: (0, 0)
Home team scores 1 goal and away team scores 0 goals: (1, 0)
Home team scores 0 goals and away team scores 1 goal: (0, 1)

For each allowed outcome $(h\_goals, a\_goals)$, we calculate the log-likelihood of the observed home and away overtime goals as follows:

\[\text{loglikelihood}_{h_g, a_g} = \log P(y_{h,ot} | \theta_{h,ot}, h_g) + \log P(y_{a,ot} | \theta_{a,ot}, a_g)\]

Here, $P(y_{h,ot})$ and $P(y_{a,ot})$ represent the probabilities of observing the home and away overtime goals, given their respective expected goal rates and the allowed outcome. And recall, these are Poisson distributed.

\[P(y_{h,ot} | \theta_{h,ot}, h_g) \sim \text{Poisson}(\theta_{h,ot} \cdot h_g)\] \[P(y_{a,ot} | \theta_{a,ot}, a_g) \sim \text{Poisson}(\theta_{a,ot} \cdot a_g)\]

Then for custom likelihood function, the log-sum-exp of the log-likelihoods is as follows:

\[\text{OT goals likelihood} = \log \left(\sum_{(h_g, a_g) \in \text{outcomes}} \exp\left({\text{loglikelihood}_{h_g, a_g}}\right)\right)\]

Here, $\exp(\text{loglikelihood}_{h_g, a_g})$ simply represents the likelihood of observing the home and away overtime goals, given their respective expected goal rates and the allowed outcome.

Shootouts

For regular season games, if the score is still the same after the overtime consideration, a shootout model is then introduced, modeling the probability of the home team winning the using a familiar logistic regression. I introduce team-specific coefficients for shootout success and failure, denoted by $so_o$ (success) and $so_d$ (failure), as well as an intercept term $so_i$ and a home advantage term $so_{adv}$. Then, we calculate the probability of the home team winning the shootout using the logistic function:

\[\text{logit}(so_{P_\text{home}}) = so_i + (so_{o,h} - so_{o,a}) + (so_{d,h} - so_{d,a}) + so_{adv} * h_i\]

Finally, we model the shootout_winner variable as a Bernoulli random variable with probability $so_{P_\text{home}}$:

\[\text{shootout winner} \sim \text{Bernoulli}(so_{P_\text{home}})\]

This shootout model is conditioned only on games that went to a shootout.

Frankly, I believe all this is far more transparent with code. So without further ado,

def overtime_goals_likelihood(observed_ot_h_goals, observed_ot_a_goals, ot_h_theta, ot_a_theta):
    allowed_outcomes = [(0, 0), (1, 0), (0, 1)]
    likelihoods = []

    for h_goals, a_goals in allowed_outcomes:
        h_likelihood = pm.logp(pm.Poisson.dist(mu=ot_h_theta), observed_ot_h_goals * h_goals)
        a_likelihood = pm.logp(pm.Poisson.dist(mu=ot_a_theta), observed_ot_a_goals * a_goals)
        likelihoods.append(h_likelihood + a_likelihood)

    return pm.math.logsumexp(pm.math.stack(likelihoods), axis=0)

    
home_idx, teams = pd.factorize(data["home_team"], sort=True)
away_idx, _ = pd.factorize(data["away_team"], sort=True)

coords = {
    "team": teams,
    "match": np.arange(len(data)),
}

with pm.Model(coords=coords) as model:
    # Global model parameters
    intercept = pm.Normal("intercept", mu=0, sigma=2)
    home = pm.Normal("home", mu=0, sigma=0.2)

    # Hyperpriors for attacks and defs
    sd_att = pm.HalfCauchy("sd_att", 0.2)
    sd_def = pm.HalfCauchy("sd_def", 0.2)

    # Team-specific model parameters
    atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, dims="team")
    defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, dims="team")

    # Demeaned team-specific parameters
    atts = pm.Deterministic("atts", atts_star - at.mean(atts_star), dims="team")
    defs = pm.Deterministic("defs", defs_star - at.mean(defs_star), dims="team")

    # Expected goals for home and away teams during regulation
    home_theta = at.exp(intercept + home + atts[home_idx] - defs[away_idx])
    away_theta = at.exp(intercept + atts[away_idx] - defs[home_idx])

    # Likelihood (Poisson distribution) for regulation goals
    home_points = pm.Poisson("home_points", mu=home_theta, observed=data['home_goals'], dims="match")
    away_points = pm.Poisson("away_points", mu=away_theta, observed=data['away_goals'], dims="match")

    # Overtime and shootout deterministics
    overtime = data['home_goals'] == data['away_goals']
    shootout = (data['home_goals_ot'] == data['away_goals_ot']) & overtime

    # Expected goals for home and away teams during overtime (scaled down by 1/12)
    ot_home_theta = home_theta * (1 / 12)
    ot_away_theta = away_theta * (1 / 12)

    # Likelihood (custom likelihood function) for overtime goals
    if overtime.sum() > 0:
        pm.Potential("ot_goals_constraint",
                    overtime_goals_likelihood(data.home_goals_ot, data.away_goals_ot, ot_home_theta, ot_away_theta))

    # Shootout model (conditioned on games that went to shootout)
    so_coeff_o = pm.Normal("so_coeff_o", mu=0, sigma=1, dims="team")  # Offensive shootout coefficient
    so_coeff_d = pm.Normal("so_coeff_d", mu=0, sigma=1, dims="team")  # Defensive shootout coefficient
    so_coeff_h = pm.Normal("so_coeff_h", mu=0, sigma=1)  # Home advantage coefficient
    so_intercept = pm.Normal("so_intercept", mu=0, sigma=1)  # Intercept term

    so_logit = (so_intercept +
                so_coeff_o[home_idx[shootout]] - so_coeff_o[away_idx[shootout]] +
                so_coeff_d[home_idx[shootout]] - so_coeff_d[away_idx[shootout]] +
                so_coeff_h * home)

    if shootout.sum() > 0:
        so_prob = pm.math.invlogit(so_logit)
        shootout_winner = pm.Bernoulli("shootout_winner", p=so_prob, observed=data['shootout_winner'][shootout])

    trace = pm.sample(4000, tune=3000)
return model, trace

Playoff Predictions

To predict playoff games, we employ a simulation-based approach using the model’s posterior estimates. For each game, posterior samples of the attack and defense strengths for both the home and away teams are extracted, along with the other model parameters, then the scoring $\theta$ values are calculated. Additionally, using a scaling factor of $K=3$, possible OT scoring is calculated. Then, for each set of sampled parameters, we calculate the probability of the home team winning in both regulation and overtimes. The mean value is then compared to a random number 0-1 to simulate if the home team wins or not.

This formulation is run for all potential matchups in the cup, and once a team hits 4 wins in a series, they advance. 500 simulations of the entire tournament are ran daily, and the probability reported is the number of simulations in which a team wins a given round.

All code can be found here.

2022 Reading List

2022-12-31T00:00:00+00:00

Books Read in 2022:

Deep Work - Cal Newport
The Man Who Solved the Market - Gregory Zuckerman
Freakonomics Revised and Expanded - Steven D. Levitt and Stephen J. Dubner
Weapons of Math Destruction - Cathy O’Neil
The Man Who Solved the Market: How Jim Simons Launched the Quant Revolution - Gregory Zuckerman
Shape: The Hidden Geometry of Information, Biology, Strategy, Democracy, and Everything Else - Jordan Ellenberg
The Hitchhiker’s Guide to the Galaxy - Douglas Adams
The Arm: Inside the Billion-Dollar Mystery of the Most Valuable Thing in Sports - Jeff Passan
Peak: Secrets from the New Science of Expertise - Robert Pool, Anders Ericsson
The Simplest Baby Book in the World: The Illustrated, Grab-and-Do Guide for a Healthy, Happy Baby - Stephen Gross

A hierarchical model for hockey scoring

2022-11-26T00:00:00+00:00

Background

They say one of the first things you should do when writing is establish credibility. I like to zag where other people zig, so here’s an unfortunate fact: I’ve done a terrible job of closely following sports outside of baseball since working in the industry. However, I try to pay attention to major storylines, and one that’s been interesting to me this year is the Bruins 18-3 start to the season.

This is particularly interesting to me because individual hockey outcomes are very random. On any given day, the worst team could beat the best and nobody would bat an eye. Looking at the FiveThirtyEight game predictions, most win probabilities are between 40-60% and the max is around 70%. That’s why seasons are long: to wash the variance out over a large sample size and allow us to get a better understanding of the true talent. So for a team to come out 18-3, there’s probably a good amount of luck folding into that, and I was curious if a quick model could give an understanding of just how lucky they’ve gotten.

Methodology

The model of choice to model NHL team strength is one that I’ve found particularly clean, originally by Baio and Blangiardo for football (soccer), and has been applied to several different goal-scoring in fixed-time sports. Goal scoring, $y = (y_{g0}, y_{g1})$, is modeled as a Poisson process:

\[y_{gj} | \theta_{gj} \sim \text{Poisson}(\theta_{gj})\]

for game $g$, with $j = {0, 1}$ an indicator variable for home ice. In their paper the rate parameter $\theta_{gj}$ is given by the following:

\[\log \theta_{g0} = \alpha + h + a_{hg} - d_{vg}\] \[\log \theta_{g1} = \alpha + a_{vg} - d_{hg}\]

Here $h$ is the home ice advantage, $a$ is the “attack strength,” $d$ is the “defense strength,” and $h$ and $v$ denote “home” or “visitor” respectively. Last, $\alpha$ is a flat intercept. In words: the home scoring rate is proportional to the attack skill of the home team, minus the defense of the away team, plus home ice advantage. The away scoring rate is the opposite, with no advantage.

Last, these parameters must have priors. This is approached using a hierarchical approach in which the team attack and defense parameters are themselves drawn from a shared distribution. By taking a hierarchical approach, we can use the knowledge that while each offense can be strong or weak, offense is a similar concept so we can share information between the population, by drawing their parameter values from the same distribution (and similarly for defenses). This causes what’s known as Bayesian shrinkage bringing what may normally be extreme values in a non-hierarchical model closer to the population mean.

The priors are here, using a non-centered parameterization to help sampling run computationally smoother.

\[\sigma_{a} \sim \text{HalfCauchy}(0.2)\] \[\hat{a_i} \sim \text{Normal} (0, \sigma_{a})\] \[a_i = \hat{a_i} - \mu(a)\]

For team attack strength $a_i$ (in this case $\mu$ denoting the mean of the population attack strength). Similarly, for defense:

\[\sigma_{d} \sim \text{HalfCauchy}(0.2)\] \[\hat{d_i} \sim \text{Normal} (0, \sigma_{d})\] \[d_i = \hat{d_i} - \mu(d)\]

for team $i$. And finally,

\[h \sim \text{Normal}(0, 0.2)\] \[\alpha \sim \text{Normal}(0, 2)\]

These Poisson processes are fit simultaneously. This is done using pymc, the full code can be seen in Appendix 1. I adapt this model using data from the 2022-2023 hockey season thus far (330 games). The observed values that the model is fit on are the end of regulation scores, in order to maintain consistent time lengths over which the Poisson process is observed.

While I maintain the approach of the original paper, here forward I substitute the language of “attack strength” for goal-creation strength and “defense strength” for goal-suppression strength in order to articulate that the defensive measures wrap up both goalie and defense effects into one latent parameter.

Results

Posterior Latent Parameter Values

This model fits well (trace plots in Appendix 2). The parameters most of interest here are the latent goal creation and suppression strengths for each team, so without further ado,

We can see that given just about 20 games per team, there’s not a lot of daylight in goal creation ability yet. The Bruins are near the top, but Dallas takes the number one spot in terms of the median value. Of course, this being a Bayesian model, we can see that uncertainty still dominates here.

Similar goes for goal suppression ability, however there’s a bit more separation here. We can see that the Bruins and Devils top the league in goal suppression ability, while Columbus and Anaheim are notably lagging.

We can plot the medians of our posterior distributions for our parameters on a 2D plane to understand overall skill.

Here we can see that due to having the best score differential (+37), the model believes Boston does in fact sit at the top-right corner of this plane, and the wins seem to be pretty well balanced between goal creation and suppression. On the flip side, Anaheim’s -36 goal differential lands them on the opposite corner. Vancouver, in the top left corner, has scored the 9th most points, but given up the 5th most, giving high scoring play that still leaves them 23rd overall. Calgary takes the honor of being closest to league average over both metrics.

Posterior Predictive Distribution

Last we want to take a look at the posterior predictive distribution. Using the sampled values for parameters, if we reran the matches, what would we expect scoring and win totals to look like.

This does create a problem though: by construction this model does not forbid ties, the two Poisson processes can yield the same number of goals. However to estimate win totals, we will have to decide those ties.

This was done in a bit of an ad-hoc way. For each simulated season, I took the number of decided wins and number of decided losses for a given team, these would be proportional to their goal creation and goal suppression ability, so it made sense to use. Of the decisions, I created a “win-rate,” $p_{i}$ given by decided wins divided by total decided matches for each team. In each given matchup, if the Poisson processes yielded a tie, I rolled a Bernoulli trial, with success probability equal to the harmonic mean of $p_{ih}$ and $(1-p_{iv})$. If the trial is successful, the home team wins the tie, otherwise the away team does.

After this, we can rerun the season to this point, given our modeled team-level run creation and run suppression ability. Of our 2000 simulated seasons to this point, here’s the number of wins the bruins had in each:

The most likely outcome for the Bruins given their above average goal creation and suppression ability is 12 wins. They’ve blown that out of the water with 18, which they achieve that many or greater in just 2.7% of seasons simulated.

This can be repeated for every team:

Here, the posterior median wins and 68% highest density interval (HDI) is shown. The Bruins have the highest median wins, but only edging out the Devils, Islanders, Maple Leafs, and Golden Knights by one win. The max wins covered by their HDI matches those teams as well.

Summary

In high variance sports, typically I think of individual outcomes as biased coin flips: good teams will win slightly more often, bad teams will lose slightly more often, and long seasons allow us to see the degree of that bias. The output of this model depicts a picture that is much closer to that mental model, compared to what an 18-3 team would otherwise evoke. The most common alternate reality created by this model has the Bruins with just 12 wins, indicating it seems they’ve gotten fairly lucky so far. At the risk of stating the obvious, I would not expect them to continue winning at this clip, but specifically the model suggests expecting closer to a 57% win percentage (12 wins in 21 games) based on the derived parameters.

That being said, this model is built on a slightly shaky foundation, and there is room for improvement. The first major issue is that the only data input into this model is scoring in regulation, in an attempt to stay true to the Baio and Blangiardo model, as well as to fulfill the requirement of Poisson distributions to be count-based data. However, there is randomness in actual goal scoring, and something like expected goals (xG), conditional things like shot location, defender proximity, etc. have proven to be more predictive of future performance, and taking such an approach would likely yield better estimates.

Secondarily, latent goal creating and goal suppressing parameters are very broad concepts. Sweeping goal tending and defense into one parameter is definitely not the best possible approach here, and the easiest gain might be to model goal tending itself - possibly by using shots on goal as a Poisson process, and playing a similar game where the rate is lowered by a latent skill parameter of the opposing goalie.

All in all, this study allowed me to employ a model which I’ve found quite clean on an interesting use-case, the Bruins hot start, giving insights into goal creation and goal suppression ability league-wide so far this year.

Code to replicate this study can be found on GitHub, and data is pulled from the Hockey-Reference Schedule and Results.

Citations

Gianluca Baio and Marta Blangiardo. Bayesian hierarchical model for the prediction of football results. Journal of Applied Statistics, 37(2):253–264, 2010.

Appendices

with pm.Model(coords=coords) as model:

    # constant data
    home_team = pm.ConstantData("home_team", home_idx, dims="match")
    away_team = pm.ConstantData("away_team", away_idx, dims="match")

    home_score = pm.MutableData("home_score", df_f.regulation_home_score, dims="match")
    away_score = pm.MutableData("away_score", df_f.regulation_away_score, dims="match")


    # global model parameters
    home = pm.Normal("home", mu=0, sigma=0.2)
    intercept = pm.Normal("intercept", mu=0, sigma=2)

    # attacks
    sd_att = pm.HalfCauchy("sd_att", 0.2)
    atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, dims="team")

    # defs
    sd_def = pm.HalfCauchy("sd_def", 0.2)
    defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, dims="team")

    # team-specific model parameters
    atts = pm.Deterministic("atts", atts_star - at.mean(atts_star), dims="team")
    defs = pm.Deterministic("defs", defs_star - at.mean(defs_star), dims="team")
    home_theta = at.exp(intercept + home + atts[home_idx] - defs[away_idx])
    away_theta = at.exp(intercept + atts[away_idx] - defs[home_idx])

    # likelihood of observed data
    home_points = pm.Poisson(
        "home_points",
        mu=home_theta,
        observed=home_score,
        dims=("match"),
    )
    away_points = pm.Poisson(
        "away_points",
        mu=away_theta,
        observed=away_score,
        dims=("match"),
    )

    trace = pm.sample(500, tune=500, cores=4)

Accessing Public Baseball Data in Julia

2022-01-13T00:00:00+00:00

tl;dr - see this GitHub gist

Background

For a couple years now, I’ve been super interested in the Julia language. One issue I had when when I was doing public-facing baseball work, is that there are great libraries in both Python (pybaseball) and R (baseballr) for loading in baseball data, but no such library for Julia (yet!). Luckily, Julia has great interoperability support, so we can utilize those libraries to pull baseball data into Julia DataFrames - it just takes a little bit of massaging.

pybaseball

Prerequisite: a working Python installation with pybaseball installed, which can be installed via pip. I recommend creating a designated Python virtual environment to work with Julia, and when you build PyCall, set ENV["PYTHON"] = venv/bin/python3. Activate that virtual environment and run pip install pybaseball

For interoperability with Python, Julia has PyCall.jl. Once loaded into Julia, use pyimport to load pybaseball into your Julia session. The methods within pybaseball return Pandas Dataframes, which If you’re interested in using Pandas.jl, the conversion is straightforward, however it’s not trivial to get to Julia’s DataFrames. The approach I’ve found is to immediately use the pandas.DataFrame.to_csv, method without a file to get the dataframe as a string. Then, read that in as an IOBuffer to CSV.jl, and sink it to a Juila Dataframe.

using DataFrames, PyCall, CSV
pybaseball = pyimport("pybaseball")
python_df = pybaseball.statcast("2021-04-06")
julia_df = CSV.read(IOBuffer(python_df.to_csv()), DataFrame)

And for an example plot…

using StatsPlots
@df filter(
    row -> row[:events] in ["field_out", "single", "double",  "triple", "home_run"], 
    dropmissing(julia_df, :events)
    ) StatsPlots.scatter(
        :launch_speed, 
        :launch_angle, 
        group=:events, 
        alpha=0.5, 
        xlabel="Launch Speed", 
        ylabel="Exit Angle"
    )

baseballr

Prerequisite: a working R installation with baseballr installed. Open R and run: devtools::install_github("BillPetti/baseballr").

Interoperability with R is done via RCall.jl. RCall can load R libraries via the @rlibrary macro, which can then be used to call baseballr (provided the library is installed). Once the library is loaded, then you can call functions via an R string, and use rcopy to migrate an R dataframe to a Julia one.

using RCall
@rlibarary baseballr
julia_df = rcopy(R"baseballr::scrape_statcast_savant(start_date = '2021-04-06', end_date = '2021-04-06')"

And there you have it!

Hopefully this enables some easier baseball analysis for others in Julia. Of course, all this work can be circumnavigated by saving dataframes from respective packages as CSVs and reading them in via CSV.jl, but who wants a million csvs laying around? There’s probably much more performant ways to go about this, but these approaches seem the quickest and most clear to me - if you have ideas or suggestions, feel free to reach out, or possibly comment on the git gist above.

2021 Reading List

2021-12-31T00:00:00+00:00

Books Read in 2021:

The Model Thinker: What You Need to Know to Make Data Work for You - Scott E. Page
In Defense of Food: An Eater’s Manifesto - Michael Pollan
Whistling Vivaldi: How Stereotypes Affect Us and What We Can Do - Claude M. Steele
Calypso - David Sedaris
Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness - Peter Godfrey-Smith
Searching for Sunday - Rachel Held Evans
A Mind for Numbers: How to Excel at Math and Science (Even If You Flunked Algebra) - Barbara Oakley
Comedy Sex God - Pete Holmes
I’m Still Here: Black Dignity in a World Made for Whiteness - Austin Channing Brown
Ninety Percent Mental - Bob Tewksbury , Scott Miller
Friendship in the Age of Loneliness: An Optimist’s Guide to Connection - Adam Smiley Poswolsky
Naked Statistics - Charles Weelan
How To: Absurd Scientific Advice for Common Real-World Problems - Randall Munroe
Swing Kings: The Inside Story of Baseball’s Home Run Revolution - Jared Diamond
You’re a Miracle (and a Pain in the Ass): Embracing the Emotions, Habits, and Mystery That Make You You - Mike McHargue
The Math of Life and Death - Kit Yates

Tyler Burch, Baseball Analyst

2021-01-25T00:00:00+00:00

Almost a year ago to the day, I posted what was by far the most popular post on this blog - a summary of my experience applying to jobs in baseball. That application cycle yielded two interviews, but unfortunately, none converted to job offers.

After that experience, I dusted myself off and did some introspection about my future while wrapping up my Ph.D. and applying to jobs. I wasn’t quite ready to let my dream of working in baseball die, and fortunately I received an offer to be a postdoctoral appointee at Argonne National Lab. This gave me the flexibility to continue advancing my skills in statistics and programming, while also doing something interesting (and receiving a paycheck). Also, since I was already was doing graduate research at Argonne, I wouldn’t have to relocate in the midst of a pandemic, which was a nice plus. However, from day one, I made sure that I was focused on the next job and that I did everything I could to make myself a better candidate for the next baseball application cycle, even discussing with my supervisor which projects would translate well across domains.

In my spare time, I worked to advance my analysis skills. One of the major takeaways from the last interview process was that I could use some work on classical statistics. To remedy that, starting in June, I woke up at least an hour before work everyday and read through statistics textbooks and coursework, taking notes and working on practice problems. I went through several books in this manner, the one I spent the most time on was Statistical Rethinking by Richard McElreath. This book got me excited about statistics, specifically Bayesian statistics. It also got me thinking about topics like causal inference, which I’d never spent time heavily considering beforehand. Along with this, I spent time learning libraries for probabilistic programming, especially pyMC3 and turing.jl.

In tandem, I spent a lot of time working on analyses and pieces for this blog, to better show off my abilities. A plot I made in 2018 led to me work on a model to predict MLB hit outcomes, a project which I got extremely in-depth, leading to a 4 part series. I took some of the Bayesian statistics I’d worked on and implemented them in a piece doing parameter estimation of home-field advantage in 2020, and a piece looking at how many runs you might expect knowing only the number of hits. In addition to my personal blog, I also put one of the pieces on FanGraphs’ Community Research blog.

Through much of this work, I worked with a library called pybaseball. This library scrapes several data sources, allowing them to be used within the python programming language. I developed quite a few utility functions while interacting with the library, and decided to implement several of them within the library itself, becoming a contributor. All of this work was in an effort to make myself a better candidate for baseball jobs in upcoming cycles. I also took some time to learn some other things that I found interesting, but weren’t necessarily intended to improve my candidacy, notably learning the Julia programming language and looking into computer vision.

The contract for my postdoctoral appointment was two years, which bought me two hiring cycles to apply. Frankly, I was pretty uncertain about job prospects in the 2020-2021 off-season, with the pandemic and the shortened baseball season, I imagined teams largely didn’t have much money to put toward hiring. To my surprise though, there were actually quite a few listings. I applied to all the jobs I thought I might be reasonably well-suited for. With only two cycles, I couldn’t burn any opportunities, so I applied to listings from Cleveland, Los Angeles (Dodgers), Toronto, and Boston.

I won’t belabor the actual application and hiring processes this year, they were very similar to what I outlined in the last post. However, the end result this year was quite different. I’m proud to announce that I’ve accepted and started a position with the Boston Red Sox, where I’ll be working as an analyst in baseball research and development. I’m incredibly happy to have received this job offer, and super excited to work for this team. It will be a fantastic place to learn the ropes and start building my career.

My reasons for updating the old post are two-fold. First, who doesn’t like a story with a happy ending? But second, and more important, the last post left with some introspection about where to move forward, and I wanted to give some insight on what I personally found valuable. I think the following things led to a different outcome:

More knowledge: All the hours I put toward bettering my statistics skills helped in my understanding of model building, which translated both to the projects I’d worked on and to my fluency within interviews. Moreover, speaking the language of statisticians, rather than particle physics, went a long way.
More confidence: Last cycle, this felt like a dream job. The fact that I was interviewing in an MLB office was intimidating. Why should I, a guy getting a degree in physics, be here? But after that experience, and landing two interviews, I processed that this goal was actually something attainable, and felt far less out of place this cycle.
A better portfolio: Putting so much effort in my blog this year, I believe, paid off dividends. I can’t explicitly tell you how much of the work teams I applied with read, but it forced me to really explore projects at a deep level, to really understand the underlying statistics, and to promote my work and network. All of this made me a better candidate, and a better baseball researcher.

Thanks for reading, and thanks to everyone who has kept up with this blog! Of course, I’ll be leaving up old posts, but I doubt that there will be much more baseball content added in the near future. I’m sure I’ll find other new fun projects to work on and post here, so be sure to continue to keep up with me.

2020 Reading List

2020-12-31T00:00:00+00:00

Books Read in 2020:

The Big Picture: On the Origins of Life, Meaning, and the Universe Itself - Sean Carroll
The Tao of Pooh - Benjamin Hoff
Indistractable: How to Control Your Attention and Choose Your Life (audio) - Nir Eyal, Julie Li
Hello World: Being Human in the Age of Algorithms - Hannah Fry
What If?: Serious Scientific Answers to Absurd Hypothetical Questions - Randall Munroe
Misbehaving: The Making of Behavioral Economics - Richard Thaler
The Undoing Project: A Friendship That Changed Our Minds - Michael Lewis
Hit Makers: The Science of Popularity in an Age of Distraction - Derek Thompson
The Elephant in the Brain: Hidden Motives in Everyday Life - Kevin Simler, Robin Hanson
White Fragility: Why It’s So Hard for White People to Talk About Racism - Robin DiAngelo
The Signal and the Noise: Why So Many Predictions Fail – but Some Don’t - Nate Silver
Range: Why Generalists Triumph in a Specialized World - David Epstein
Raise Your Game: High-Performance Secrets from the Best of the Best - Alan Stein
K: A History of Baseball in Ten Pitches - Tyler Kepner
When: The Scientific Secrets of Perfect Timing - Daniel H. Pink
Superforecasting: The Art and Science of Prediction - Philip Tetlock, Dan Gardner
The Shift - Colby Martin
Promise Me, Dad: A Year of Hope, Hardship, and Purpose - Joe Biden
The Sin of Certainty - Peter Enns
How to Be an Antiracist - Ibram X. Kendi
Scarcity: Why Having Too Little Means So Much - Sendhil Mullainathan, Eldar Shafir
The Universe Speaks in Numbers: How Modern Math Reveals Nature’s Deepest Secrets - Graham Farmelo
The Pragmatic Programmer: 20th Anniversary Edition, 2nd Edition: Your Journey to Mastery - David Thomas, Andrew Hunt

Textbooks Completed:

Statistical Rethinking - Richard McElreath

The Great Courses Lectures Completed:

Understanding Complexity - Scott E. Page
The Learning Brain - Thad A. Polk

Audible Originals:

Caffeine: How Caffeine Created the Modern World - Michael Pollan
The Burnout Generation - Anne Helen Petersen

Last Update - December 22, 2020

Box Score Thoughts: Tempering Run Expectations from Hits

2020-11-13T00:00:00+00:00

Background

About a year and a half ago, I was reading box scores one day and saw a game with a ton of hits, but very few runs and was surprised at this. Then I paused to wonder - should I be surprised? How should I expect these to scale? I’ve been watching baseball for a long time, but I wondered if my a priori expectations here were miscalibrated.

That night I quickly put together an ad hoc analysis which I posted here, but never quite felt proud about; the methodology wasn’t great. Rather than take it down, I’ve worked to improve it (the original post appears in it’s entirety at the bottom of the page). The ultimate question I wanted to answer is “conditional on the number of hits, what should I expect for number of runs?” To do so, I’ve developed a model with the primary goal of creating a better baseline for my personal expectations.

Data

Just looking at the raw distribution of this parameter space for 2018 data:

There’s a lot to note here. First, most of these points represent many, many games, and this distribution is overwhelmed by the data in the lower corner of the distribution. This is emphasized both by the density contours shown in red and the histograms on each axis. However another key feature here is that width of distribution spreads out as the number of hits increases. This is a quality known as heteroscedasticity, a word which I will never be able to spell right on the first try. It’s important to note that one of the assumptions for traditional linear regression is that it’s homoscedastic, so this assumption is violated in this case.

Model

After considering several different ways to model this, I ultimately used a Bayesian linear regression model (using PyMC3), with an extra trick to manage the heteroscedasticity. While standard linear regression assumes constant variance, instead I used a linear model for both the mean and variance of the outcome variable. That is to say, we assume that the number of runs is normally distributed, in which both the mean and variance of that normal distribution scale linearly with respect to the number of hits. Expressly, the model is:

Here you can see two linear models for each of the parameters of the normal distribution. I’ve applied a trick to $\sigma_i$, the standard deviation, transforming the linear model with an exponential function in order to ensure that the value is positive. In terms of a directed acyclic graph, this looks like:

This model is then conditioned on the data from 2018, and yields the following distribution, with the 1 and 2 standard deviation bands shown.

This model has good coverage of the data, and takes into account heteroscedasticity pretty well. Several other models were considered - a standard normally distributed linear regression, a more robust regression using a Student’s T-distribution, and this one proved to be the best (in terms of information criteria metrics).

Takeaways

The takeaway here is that while we expect the number of runs to scale with the number of hits, the range we expect should be very broad, even broader than my default expectations. With 5 hits, for example, we can be 68% confident that there will be between 1 and 3 runs, but our 95% credible interval is anywhere from 0 to 5 runs, so neither of these outcomes should surprise us. It’s not until 8 hits that 0 runs falls out of the 95% credible interval - considering more than 1/8 hits in 2018 were home runs (13.6%), one of those probably automatically converts to a run, so this isn’t entirely surprising.

I think the points of reference I will keep in mind from this in future baseball watching is that at 5 hits, 0 runs falls out of the 68% credible interval, and at 6 hits 1 run falls out. Teams that don’t meet those benchmarks are in the lower 34% of run conversion with respect to hits. On the flip side, teams producing more runs than hits are all outside the 1 sigma band. Again I can use 5 hits as a reference here - at 5 hits, 4 runs puts a team in the upper 34%, so I can keep in mind $(N-1)$ runs for $N$ hits as a well-above average result. At 9 hits, this becomes $(N-2)$.

The code, along with additional plots, model comparisons, and more is available on my GitHub in this Jupyter notebook.

Original Post - June 23, 2019

Recently after seeing a box score with many hits but no runs, I’ve been thinking a bit about run expectancy based on the number of hits in a game. Obviously these are going to be correlated, but I was curious, how strongly, and how well you could predict runs given number of hits.

I went and looked at 2018 data for runs scored based on number of hits and did a linear regression:

As expected, the Pearson Correlation is 0.779, which falls under “strongly correlated.” In an attempt to try to get to some prediction method, I calculated the mean at every hit value, and plotted those values, then tried to fit that data.

Looking at just the means, a linear fit does not match the data well, so a polynomial fit of order 2 was used, which does. The coefficient on the x² term appears small, but once you get to 6 hits, it raises the expectation by an additional run. Once you add standard deviation uncertainty lines to this plot though, you could see that within the 68% confidence interval of 1 sigma, a linear fit could probably work just as well to this data.