How to do Bayesian shrinkage estimation on proportion data in Stan

Attention conservation notice: In which I once again document my slow march towards Bayesian fundamentalism. Not of interest unless you are interested in shrinkage estimation and Stan.

As I describe in my essay on trying to determine the best director on imdb, you usually can’t trust average ratings from directors with a small number of movies to be a good estimate of the “actual” quality of that director.

Instead, a good strategy here is to shift the average rating from each director back to the overall median, but to shift it less the more movies that person has directed. This is known as shrinkage estimation, and in my opinion it’s one of the most underused statistical techniques (relative to how useful it is).

The past few weeks I’ve been trying to learn the Bayesian modeling language Stan, and I came across a pretty good model for shrinkage estimation using a beta-binomial approach in this language (described in 1, 2). Here’s the model, which uses batting averages from baseball players.

In order to determine the amount of shrinkage in this model, I plotted the “actual” (or “raw”) average versus the estimated average using this model, and colored the data points by the log of the at bats (lighter blue = more at bats).

Screen Shot 2016-03-10 at 7.39.03 PM

As you can see, players with more at bats have less shrinkage. At the extreme, two players who are 0/1 on the season still have an estimated average of ~ 0.26 (which is the median of the “actual” batting averages).

Notably, there are fewer players whose averages are decreased due to the shrinkage estimation than the reverse. Perhaps managers are inclined to give players a few more shots at it until they prove that their early success was just a fluke.

A clinical trial for omental transposition in early stage AD

A couple of years ago I wrote about treating AD with omental transposition, a radical therapy with success in ~ 35% of patients in one case series. Today I just noticed that there is a non-randomized, single-arm clinical trial on its use in patients with early stage AD (MoCA score 11-18), in Salt Lake City, UT. Estimated study completion date: May 2019.

This is especially interesting because they have a relatively thorough explanation of how the surgery works. In the general surgery portion of the procedure, an omental flap is created, which receives blood supply from the right gastric and gastroepiploic arteries. Next, a subcutaneous tunnel is created that travels up the chest wall and neck to behind the ear.

In the neurosurgery portion of the procedure, a portion of bone is removed near the temporal-frontal area, followed by removal of the dura and arachnoid membrane. The omentum is then placed on the parietal-temporal-frontal area of one cerebral hemisphere, and connected to the dura via a suture.

Besides this tissue grafting approach, other neurosurgical approaches to Alzheimer’s have included:

  1. CSF shunts (to the atria or ventricles)
  2. Intraventricular infusions (of bethanecol, NGF, or GM1)
  3. Gene therapy with infusion of NGF-expressing cells
  4. Electrical stimulation (of the vagus nerve, nucleus basalis of Meynert, or the fornix)

To what extent does whole genome sequencing add value above SNP arrays?

Attention conservation notice: I wrote this as the final essay for my course on personal genomics with Michael Linderman at Mt Sinai. The main question of the essay was: what is the point, in 2016, of getting your whole genome sequencing (WGS) data, if you already have your SNP data? Overall, I found analyzing my WGS data an interesting experience, but the vast majority of known genomic info is still at the SNP level, and there are some bugs in contemporary variant callers that make WGS calls more likely to be false-positives, as I experienced first-hand.

This fall, I was lucky enough to be a part of a course at ISMMS where we learned about genomics by analyzing our own whole genome sequencing results, which was graciously paid for by the school [1]. Amazingly, the cost of genome sequencing dropped from $5000 to $1500 (or even $1000) in just this past year, but it’s still a significant investment in our education by ISMMS and I appreciate it. According to the course director, Michael Linderman, there’s only on the order of around 1000 people with access to their own whole genome sequencing results and the ability to interpret them, which puts us in a pretty small, fortunate group. That said, it probably won’t be a small group for long, since just over the past few months, Veritas Genetics announced that it will offer WGS alongside analysis commercially for a new low-price of $999 [2].

I already had access to single-nucleotide polymorphism (SNP) array results from 23&Me, so a very basic question was what kind of data I could get from having my whole genome sequenced that I didn’t already have access to. First, some terminology: the difference between a SNP and a normal genetic variant is that the alternate allele of a SNP must be present in at least 1% of the population. Not surprisingly, most of the papers published about genomics on PubMed study the effect of SNPs, in large part because those are the variants for which there is sufficient power to address biomedical questions robustly. So I already had access to the majority of the well-studied variants through my SNP data. So from one perspective, going from the ~300,000 SNPs that I got from 23&Me to the ~3,000,000,000 base pair calls in the human genome seems like a classic case of the big data trap: collecting more data without any point. And I’ll freely admit that I’ve fallen victim to this tendency at least a few times in my life.

Upon a little bit more literature and soul searching about what I expected to learn, it became apparent that what whole genome sequencing is best at is detecting very private variants – that is, unsurprisingly, things that are present in less than 1% of the population. Any such rare variants that I would found might be present just in my immediate family a countable number of generations back, or they might even be found only in me. But these rare variants can add up to a fairly non-trivial number. As it turns out, the average person has about 100 heterozygous loss of function variants, which includes stop insertions, frameshift mutations, splicing mutations, and large deletions [3]. And since my dad was on the older side when I was born, and older male age is associated with more new genetic variants [4], I knew that I was liable to have an especially large burden of new variants.

On the big day when our sequences had been finally aligned and the variants had been called, the first thing I did was to filter those variants down to the 2000 or so ones that were most likely to be damaging. I scanned down the gene list meticulously, looking for gene names that I recognized. Since I had to memorize a fairly large number of disease-causing genes during my preclinical med school courses, I figured recognizing a gene name would in general be a bad sign. I was relieved and felt lucky to discover no major disease-causing mutations in genes that I knew would cause major disease, such as the cancer-promoting genes BRCA1/2 [5]. Overall this process was not very efficient, but it was pretty fun.

The next time that I sat down to analyze my genetic variants, I decided to filter for variants that were likely to have an effect on the way I think. So I intersected the genes in which I had predicted function-altering variants with another list from a study [6] that measured which genes have the highest RNA expression – a proxy for “are made the most” – in neurons. Here’s a plot of the results:

Screen Shot 2016-02-22 at 6.31.02 PM

The green dot represents the gene in which I have a predicted damaging mutation with the strongest expression in neurons, which is the gene SYN2. The protein that this gene codes for is thought to be selectively produced in synapses, where it probably plays a role in synaptic vesicle transport [7]. Synaptic vesicles, in turn, are what neurotransmitters are stored in before are they are released into the synaptic cleft to communicate with the postsynaptic neuron. You might think of them as the “cargo trucks” of the synapse, storing and carrying around the payload of neurotransmitters before they are sent to the next neuron. So naturally, I became curious about what the effect of that variant might be.

First, I took a look at what my actual predicted variant in the SYN2 gene was. Specifically, I was predicted to have a frameshift mutation, due to the deletion of a CGCGA sequence at chromosome 3, position 12,046,269. In general, frameshift mutations are pretty cool. DNA is made into proteins three nucleotides at a time, so mutations in multiples of three only alter a small number of amino acids. But if a frameshift mutation messes up this three nucleotide reading frame, then the whole rest of the protein is totally different. What was predicted to happen in my version of the SYN2 protein is that, 66 nucleotides later after the frameshift, a new stop signal was introduced. So I would have 22 amino acids in my version of SYN2 that are not found in most people, and then the protein was predicted to end. Although it’s fun to speculate that maybe those 22 amino acids could turn me into a mutant supergenius if I could just learn how to tap into its mythical synaptic powers, most likely my predicted mutant version of SYN2 would be simply degraded. And since I’m predicted to be heterozygous for the mutation, my non-mutated version of SYN2 could simply pick up the slack. That said, in the absence of compensation, I’d be expected to have ~50% less of this key synaptic protein than the average person.

Naturally, next I did a search for the functional role of a loss of function mutation in SYN2. The first paper I found [7] had the suddenly ominous title: “SYN2 is an autism predisposing gene: loss-of-function mutations alter synaptic vesicle cycling and axon outgrowth.” Specifically, this paper showed that two missense (amino-acid changing) and one frameshift mutation were found in male individuals with autism spectrum disorder, but none were found in male controls with autism spectrum disorder. They also showed that neurons lacking SYN2 have a lower number of synaptic vesicles ready to be released from their synapses, which is consistent with the predicted role of SYN2. I had some qualms about this paper, like the fact that they extrapolated from SYN2 homozygous knock-out mouse studies to humans that were heterozygous for a loss-of-function variant in SYN2, and indeed the mouse study that they built upon did not find a phenotype in SYN2 heterozygous knock-out mice [8]. But overall, this study was a sign that my predicted frameshift mutation might really be playing a significant functional role.

Given that I also had access to SNP data from both of my parents through 23&Me, my next step was to find out which of them I inherited the predicted SYN2 frameshift variant from, so that I could figure out which of my parents I would be able to subsequently blame for all of my problems. But this is where things took another unexpected turn. In order to discover which of my parents was the culprit, I had to analyze the raw reads in the Integrated Genome Viewer (IGV), to find another tagging SNP that I could also see in the data from 23&Me. But when I actually looked at the reads, what I discovered here instead was way more homozygous variation (seen via the single-colored vertical lines) relative to the reference genome than I expected:

Screen Shot 2016-02-22 at 6.30.38 PM

This homozygosity of the variants is surprising and makes us suspicious that maybe there’s something going on other than just the mutation – maybe there was a problem in aligning my reads to the reference genome. And indeed, for technical reasons that are beyond the scope of this essay, in class we aligned to the hg19 build of the reference genome, which as it turns out, happens to differ from the hg38 reference genome at this region pretty substantially. And when I aligned one of the individual sequencing reads against the hg38 reference at this location, what I detected was not a deletion, but rather an insertion of 12 base pairs. Since 3 divided by 12 is a whole number, 4, that means that this is an in-frame mutation, which is much less likely to have the serious loss-of-function effect that a frameshift mutation would. And indeed, looking at the DNA sequence that was inserted, it appears that the insertion is probably due to a tandem repeat, with one mismatch:

Screen Shot 2016-02-22 at 6.29.56 PM

So, to recapitulate, analyzing the raw reads using the updated reference genome, I found out that likely I do not have a frameshift mutation in SYN2 after all. That said, the potential presence of a tandem repeat expansion within the coding sequence of SYN2 – leading to four extra amino acids in that protein – is itself pretty interesting and could still have some sort of a biological effect. After all, this protein is likely a key component of the cargo truck for my neurotransmitters.

In summary, I think I can say that if you’ve had your SNP data analyzed, that’s going make up the lion’s share of digestible information. However, there are likely to be some interesting things for you to learn from having your WGS data analyzed as well. First, although I didn’t/haven’t yet found any rare variants in my genome that might significantly increase my risk of disease in a potentially actionable way, I certainly could have. You don’t bring a life jacket on a boat because you think you’re going to fall overboard – you bring it because you might. Second, it was enlightening to learn first-hand about the lack of adequate tools for analyzing genomes, especially at the variant calling and variant analysis steps. We really are in the Wild West era of genomics. This is both exciting and motivating. I now have a better idea of what it is like to have a likely false positive variant call like I had with SYN2.

Finally, getting your genome sequenced isn’t just about your own health – it’s also about your family’s health and the health of society at large. For example, I’m also in the process of donating my whole genome sequencing data to the Personal Genome Project (I’ve already put up my VCF file). If you have access to SNP data and/or you want to try to have your whole genome sequenced, and you are willing to make the data publically available, then you should consider joining too. I think that by pooling genome and phenotype data in an open way, we’re going to make some discoveries that will improve human health in a big way.


[1]: Linderman MD, Bashir A, Diaz GA, et al. Preparing the next generation of genomicists: a laboratory-style course in medical genomics. BMC Med Genomics. 2015;8:47.

[2]: whole-genome-barrier-300150585.html

[3]: Macarthur DG, Balasubramanian S, Frankish A, et al. A systematic survey of loss- of-function variants in human protein-coding genes. Science. 2012;335(6070):823- 8.

[4]: Kong A, Frigge ML, Masson G, et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 2012;488(7412):471-5.

[5]: You might be wondering, if you’re male, then why are you worried about a BRCA mutation? Well, although BRCA mutations are much more dangerous in women, they can also increase the risk of certain cancer types in men. For example, according to one study with 1000 participants, there is around a 5-fold increased risk for prostate cancer in men with a BRCA2 mutation. See: Kote-jarai Z, Leongamornlert D, Saunders E, et al. BRCA2 is a moderate penetrance gene contributing to young-onset prostate cancer: implications for genetic testing in prostate cancer patients. Br J Cancer. 2011;105(8):1230-4.

[6]: Zhang Y, Chen K, Sloan SA, et al. An RNA-sequencing transcriptome and splicing database of glia, neurons, and vascular cells of the cerebral cortex. J Neurosci. 2014;34(36):11929-47.

[7]: Corradi A, Fadda M, Piton A, et al. SYN2 is an autism predisposing gene: loss-of- function mutations alter synaptic vesicle cycling and axon outgrowth. Hum Mol Genet. 2014;23(1):90-103.

[8]: Greco B, Managò F, Tucci V, Kao HT, Valtorta F, Benfenati F. Autism-related behavioral abnormalities in synapsin knockout mice. Behav Brain Res. 2013;251:65- 74.


Five years of spaced repetition flashcards

Attention conservation notice: Excessive navel-gazing.

First, some shameless Anki nerd bragging:

Screen Shot 2016-02-05 at 12.16.42 PM

Screen Shot 2016-02-05 at 12.16.49 PM

Screen Shot 2016-02-05 at 12.16.59 PM

Screen Shot 2016-02-05 at 12.17.06 PM

Screen Shot 2016-02-05 at 12.17.22 PM

Although I’ve been doing spaced repetition via Anki for 5 years now, I actually wanted to start doing SR flashcards about 8 years ago — approximately speaking, since 5/6/08, when I read this Wired article about Piotr Wozniak. Wozniak truly devoted himself to spaced repetition. He just generally seemed like an interesting person, his technique seemed like a good idea, and I wanted to do it [1].

For the next 2.5 to 3 years I had a pretty constant low-level of guilt/anxiety about how I should be doing SR flashcards. This anxiety spiked whenever I forgot something I had previously learned or clicked on a purple Wikipedia link. Of course, after a year or so, I thought it was too late and that I was a lost cause with respect to spaced repetition.

Coincidentally, I now have a flashcard for how to solve this exact problem. It’s called “future weaponry” — instead of thinking about what you could have done with a particular tool/knowledge/ability in the past, choose to think about what you will be able to do with it in the future.

Spaced repetition flashcards weren’t really feasible, though, until I got a smartphone. Say what you want about smartphones with respect to productivity overall, but the ability to do SR flashcards on them while walking and waiting around is insanely crucial and underrated. And this basically requires easy syncing via an internet connection.

Looking back at my cards from when I started, they’re pretty terrible. For example, I used tons of cloze deletion cards, a lazy and less effective way of making flashcards, rather than thinking about what knowledge I really wanted to retain and framing it in question form. I was also obsessed with memorizing math equations even though these have been pretty much completely useless.

That said, by far the most important thing for me back then was to maintain motivation, and I’ve been able to do that so far. You can find some of my spaced repetition flashcards cards for statistics, R programming, and other topics here.

The spike in flashcards that you see the middle of this time period was due to studying during the first two years of med school, for which I made and did a lot of flashcards. I think studying this way actually made me less good at any individual exam, since other cramming methods are more effective, but my hope is that it will help me to retain the knowledge better in the long run.

During this time, my friends and I were also studying for a big med school exam, called Step 1, and which we referred to as “D-Day.” Watching the documentary Somm, that is definitely what our lives were like, especially the us-against-the-world feeling. Spaced repetition flashcards were a big part of it. On any given day we usually had 400+ flashcards to do, which we called “slaying the beast.” We all ultimately did passed and did pretty well on Step 1, although looking back it was more about the journey.

Since starting full-time on my PhD program a year and a half ago, I’ve been using SR flashcards for two purposes:

  1. Learning programming languages, both the syntax and the concepts. The syntax has probably been more useful, but learning vocabulary related to the concepts has also been especially high yield for knowing what to search for.
  2. Learning about my research topics, e.g., Alzheimer’s and genomics. The jury is still out on whether this is a good use of time, but in general, I would say not to sleep on the potential for it if you’re a researcher. A lot of people like to read, but there’s a lot of value to be gained from systematically reflecting upon what you’ve read, especially if you have an imperfect memory like me.

I owe a lot to Damien Elmes, who wrote the open-source, free Anki software. I also owe a lot to Gwern, who wrote about spaced repetition extensively, and who made thousands of his Mnemosyne cards available to anyone to download for free. I downloaded these one day on a whim, converted them to Anki, and that was what really made me think of having my own cards as being a realistic, practical option. Thanks, Damien and Gwern.

[1]: The other thing I remember from that article, 8 years later: I often think about how he tries to minimize how often he drove in cars.

New page on biomedical trade-offs

Throughout my first two years of med school, I was surprised by how many of the most tricky — and to me, most interesting — topics in medicine involved some sort of underlying trade-off. For example, I couldn’t understand dynamic compression of the airways pretty much at all until I realized that it was a prototypical trade-off, in that higher expiration rates help push out CO2-enriched air faster, but also lead to a higher risk of airway collapse. Today I added a new page with a lot of these biomedical trade-offs, which is currently at 16 trade-offs, but I’m planning on adding more as I learn more. Hopefully somebody will find them useful, even if it’s just my future self.

Classic Papers #1: On the diagram, by John Venn

Title: “On the diagrammatic and mechanical representation of propositions and reasonings

Author: John Venn

Journal: Philosophical Magazine

Date published: July 1880

Builds upon: Euler diagrams

Citations (Google Scholar): 336

Citations Since 2010: 152

Best figure: Not satisfied with two sets, he jumped right to the symmetry-preserving extreme of his system and drew out a four set intersection with a place for labels:

Screen Shot 2016-01-12 at 6.32.48 PM

This would hardly be out of place in a genomics article published today.

Best sentence: “The fact is, as I have explained at length in the article above referred to, that the five distinct relations of classes to one another (viz. the inclusion of X in Y, their coextension, the inclusion of Y in X, their intersection, and their mutual exclusion), which are thus pictured by these circular diagrams, rest upon a totally distinct view as to the import of a proposition from that which underlies the statements of common life and common logic.”

Oddest moment: “I have no high estimate myself of the interest or importance of what are sometimes called logical machines, and this on two grounds. In the first place, it is very seldom that intricate logical calculations are practically forced upon us; it is rather we who look about for complicated examples in order to illustrate our rules and methods. In this respect logical calculations stand in marked contrast with those of mathematics, where economical devices of any kind may subserve a really valuable purpose by enabling us to avoid otherwise inevitable labour. Moreover, in the second place, it does not seem to me that any contrivances at present known or likely to be discovered really deserve the name of logical machines. It is but a very small part of the entire process which goes to form a piece of reasoning which they are capable of performing.”

(No wonder Turing proposed his test of whether something was a “true” AI.)

(Runner up: Venn’s use of the word “especial” instead of “special.”)

Lasting impact: This paper is a classic that jumps directly to the tough questions of how to visualize sets and set differences and even directly addresses the utility of a sort of artificial intelligence, or as Venn calls it, a “logical machine.” And of course, it introduced what we now know as the Venn diagram.

Editorial note: This is the first entry in what will hopefully be a series of classic papers cutting across disciplines that I’m interested in. For some reason, papers don’t seem to be discussed as commonly as books in my circles, which is strange because they’re shorter, usually more novel, and more information dense. This series is an attempt to write the blog posts I want to see in the world.

Nine paradoxes with a statistical theme


  1. A drill sergeant always yells at one of his trainees when she messes up. The drill sergeant notices that after he yells at her, her performance improves. Later it turns out that the trainee is deaf, blind, and has no other way of actually noticing that drill sergeant is yelling at her. Ignoring the effect of practice, why might the trainee’s performance have improved anyway?
  2. You have 100 pounds of Martian potatoes, which are 99 percent water by weight. You let them dehydrate until they’re 98 percent water by weight. How much do they weigh now and why?
  3. Imagine that your parents had rolled a six-sided die to decide how many children to have. What did they most likely roll and why?
  4. You have access to planes that have returned from military missions and the distribution of the bullet “wounds” on the planes. Which areas should you recommend to have extra armor?
  5. Why would few people choose to play in a lottery with a small but actual probability of success with an infinite monetary expected value?
  6. Do most people have have the same, more, or fewer friends than their friends have on average and why? 
  7. Hypothetically, say that 80% of people dream in color, and 68% of sexual partners have the same (concordant) coloring of their dreams. If you dream in color, what’s the probability that your partner will too?
  8. Are we biased to think that cars in the lanes next to us are going faster or slower than they really are and why? 
  9. Why is the expression “the smallest positive integer not nameable in under eleven words” paradoxical?


  1. regression to the mean — the screwup is likely a random deviation below the trainee’s average, which will tend to improve on the subsequent iteration just due to random chance, regardless of any action by the drill sergeant (more here
  2. 50 pounds, since the percentage of non-water by weight has doubled, so the overall weight must have halved (more here
  3. they most likely rolled a six, because there’s a higher chance of you existing to observe the event in that case (more here
  4. the areas with no damage, because of selection effects — planes that fell likely suffered an attack in a place that was untouched on those that survived (more here
  5. because the marginal utility of money is diminishing (more here)
  6. fewer, because sampling bias suggests that people with greater numbers of friends have an increased likelihood of being observed among one’s own friends (more here
  7. 80%. Since basic probability theory tells us that 0.8 * 0.8 + 0.2 * 0.2 = 0.68, so we know that the probability of dreaming is color is independent of that of your sexual partner. Therefore, the probability that your partner dreams is color is independent of yours and is simply the base rate. Some people think 68%, perhaps because they are getting wrapped up in the causal story. (more here)
  8. we are biased to think they are going faster, likely because because more time is generally spent being overtaken by other vehicles than is spent in overtaking them (more here)
  9. there are finitely many words, so there are finitely many numbers that can be defined in under eleven words, so there must be such an integer, but since this expression itself is under eleven words, there cannot be any such integer (more here; resolved by assigning priority to the naming process either within or outside of the expression) 
Screen Shot 2015-12-29 at 7.50.09 PM

these are totally Martian potatoes;