How to do Bayesian shrinkage estimation on proportion data in Stan

Attention conservation notice: In which I once again document my slow march towards Bayesian fundamentalism. Not of interest unless you are interested in shrinkage estimation and Stan.


As I describe in my essay on trying to determine the best director on imdb, you usually can’t trust average ratings from directors with a small number of movies to be a good estimate of the “actual” quality of that director.

Instead, a good strategy here is to shift the average rating from each director back to the overall median, but to shift it less the more movies that person has directed. This is known as shrinkage estimation, and in my opinion it’s one of the most underused statistical techniques (relative to how useful it is).

The past few weeks I’ve been trying to learn the Bayesian modeling language Stan, and I came across a pretty good model for shrinkage estimation using a beta-binomial approach in this language (described in 1, 2). Here’s the model, which uses batting averages from baseball players.

In order to determine the amount of shrinkage in this model, I plotted the “actual” (or “raw”) average versus the estimated average using this model, and colored the data points by the log of the at bats (lighter blue = more at bats).

Screen Shot 2016-03-10 at 7.39.03 PM

As you can see, players with more at bats have less shrinkage. At the extreme, two players who are 0/1 on the season still have an estimated average of ~ 0.26 (which is the median of the “actual” batting averages).

Notably, there are fewer players whose averages are decreased due to the shrinkage estimation than the reverse. Perhaps managers are inclined to give players a few more shots at it until they prove that their early success was just a fluke.

Advertisements