Tyler James Burch

Particle Physics // Machine Learning // Music // Baseball


Fivethirtyeight Riddler: Can The Riddler Bros. Beat Joe DiMaggio’s Hitting Streak?

12 May 2019

This weekend I took on fivethirtyeight’s weekly Riddler question again. The original problem text can be found here).

The problem statement:

Five brothers join the Riddler Baseball Independent Society, or RBIs. Each of them enjoys a lengthy career of 20 seasons, with 160 games per season and four plate appearances per game. (To make this simple, assume each plate appearance results in a hit or an out, so there are no sac flies or walks to complicate this math.)

Given that their batting averages are .200, .250, .300, .350 and .400, what are each brother’s chances of beating DiMaggio’s 56-game hitting streak at some point in his career? (Streaks can span across seasons.)

By the way, their cousin has a .500 average, but he will get tossed from the league after his 10th season when he tests positive for performance enhancers. What are his chances of beating the streak?

There’s two steps to this problem. First, find the probability of getting a hit in a game, which is trivial knowing BA:

The next step asks “What is the probability of getting a streak of length X in a fixed number of attempts,” which as it turns out, finding a closed form solution to this is not trivial - see discussion on askamathematician and math.stackexchange.

But this is why we have computers. I wrote up a simulation to solve this problem simulating the careers of players with the indicated batting average and career length, and found how often said players would beat DiMaggio’s hit streak. The results are as shown:

The likelihood of a player beating DiMaggio’s record can be thought of as how frequently a simulated careers beats the record compared to the total number of simulated careers. The plot just shows the [5%, 95%] range to avoid outliers from skewing the range. The results:

For simulation validation, the final simulated BA of each player was plotted, and ensured that it did, in fact line up with the BA in the problem statement. It did, with a coefficient of variation (std/mean) of between 0.01 and 0.02.