To what extent does whole genome sequencing add value above SNP arrays?

Attention conservation notice: I wrote this as the final essay for my course on personal genomics with Michael Linderman at Mt Sinai. The main question of the essay was: what is the point, in 2016, of getting your whole genome sequencing (WGS) data, if you already have your SNP data? Overall, I found analyzing my WGS data an interesting experience, but the vast majority of known genomic info is still at the SNP level, and there are some bugs in contemporary variant callers that make WGS calls more likely to be false-positives, as I experienced first-hand.

This fall, I was lucky enough to be a part of a course at ISMMS where we learned about genomics by analyzing our own whole genome sequencing results, which was graciously paid for by the school [1]. Amazingly, the cost of genome sequencing dropped from $5000 to $1500 (or even $1000) in just this past year, but it’s still a significant investment in our education by ISMMS and I appreciate it. According to the course director, Michael Linderman, there’s only on the order of around 1000 people with access to their own whole genome sequencing results and the ability to interpret them, which puts us in a pretty small, fortunate group. That said, it probably won’t be a small group for long, since just over the past few months, Veritas Genetics announced that it will offer WGS alongside analysis commercially for a new low-price of $999 [2].

I already had access to single-nucleotide polymorphism (SNP) array results from 23&Me, so a very basic question was what kind of data I could get from having my whole genome sequenced that I didn’t already have access to. First, some terminology: the difference between a SNP and a normal genetic variant is that the alternate allele of a SNP must be present in at least 1% of the population. Not surprisingly, most of the papers published about genomics on PubMed study the effect of SNPs, in large part because those are the variants for which there is sufficient power to address biomedical questions robustly. So I already had access to the majority of the well-studied variants through my SNP data. So from one perspective, going from the ~300,000 SNPs that I got from 23&Me to the ~3,000,000,000 base pair calls in the human genome seems like a classic case of the big data trap: collecting more data without any point. And I’ll freely admit that I’ve fallen victim to this tendency at least a few times in my life.

Upon a little bit more literature and soul searching about what I expected to learn, it became apparent that what whole genome sequencing is best at is detecting very private variants – that is, unsurprisingly, things that are present in less than 1% of the population. Any such rare variants that I would found might be present just in my immediate family a countable number of generations back, or they might even be found only in me. But these rare variants can add up to a fairly non-trivial number. As it turns out, the average person has about 100 heterozygous loss of function variants, which includes stop insertions, frameshift mutations, splicing mutations, and large deletions [3]. And since my dad was on the older side when I was born, and older male age is associated with more new genetic variants [4], I knew that I was liable to have an especially large burden of new variants.

On the big day when our sequences had been finally aligned and the variants had been called, the first thing I did was to filter those variants down to the 2000 or so ones that were most likely to be damaging. I scanned down the gene list meticulously, looking for gene names that I recognized. Since I had to memorize a fairly large number of disease-causing genes during my preclinical med school courses, I figured recognizing a gene name would in general be a bad sign. I was relieved and felt lucky to discover no major disease-causing mutations in genes that I knew would cause major disease, such as the cancer-promoting genes BRCA1/2 [5]. Overall this process was not very efficient, but it was pretty fun.

The next time that I sat down to analyze my genetic variants, I decided to filter for variants that were likely to have an effect on the way I think. So I intersected the genes in which I had predicted function-altering variants with another list from a study [6] that measured which genes have the highest RNA expression – a proxy for “are made the most” – in neurons. Here’s a plot of the results:

Screen Shot 2016-02-22 at 6.31.02 PM

The green dot represents the gene in which I have a predicted damaging mutation with the strongest expression in neurons, which is the gene SYN2. The protein that this gene codes for is thought to be selectively produced in synapses, where it probably plays a role in synaptic vesicle transport [7]. Synaptic vesicles, in turn, are what neurotransmitters are stored in before are they are released into the synaptic cleft to communicate with the postsynaptic neuron. You might think of them as the “cargo trucks” of the synapse, storing and carrying around the payload of neurotransmitters before they are sent to the next neuron. So naturally, I became curious about what the effect of that variant might be.

First, I took a look at what my actual predicted variant in the SYN2 gene was. Specifically, I was predicted to have a frameshift mutation, due to the deletion of a CGCGA sequence at chromosome 3, position 12,046,269. In general, frameshift mutations are pretty cool. DNA is made into proteins three nucleotides at a time, so mutations in multiples of three only alter a small number of amino acids. But if a frameshift mutation messes up this three nucleotide reading frame, then the whole rest of the protein is totally different. What was predicted to happen in my version of the SYN2 protein is that, 66 nucleotides later after the frameshift, a new stop signal was introduced. So I would have 22 amino acids in my version of SYN2 that are not found in most people, and then the protein was predicted to end. Although it’s fun to speculate that maybe those 22 amino acids could turn me into a mutant supergenius if I could just learn how to tap into its mythical synaptic powers, most likely my predicted mutant version of SYN2 would be simply degraded. And since I’m predicted to be heterozygous for the mutation, my non-mutated version of SYN2 could simply pick up the slack. That said, in the absence of compensation, I’d be expected to have ~50% less of this key synaptic protein than the average person.

Naturally, next I did a search for the functional role of a loss of function mutation in SYN2. The first paper I found [7] had the suddenly ominous title: “SYN2 is an autism predisposing gene: loss-of-function mutations alter synaptic vesicle cycling and axon outgrowth.” Specifically, this paper showed that two missense (amino-acid changing) and one frameshift mutation were found in male individuals with autism spectrum disorder, but none were found in male controls with autism spectrum disorder. They also showed that neurons lacking SYN2 have a lower number of synaptic vesicles ready to be released from their synapses, which is consistent with the predicted role of SYN2. I had some qualms about this paper, like the fact that they extrapolated from SYN2 homozygous knock-out mouse studies to humans that were heterozygous for a loss-of-function variant in SYN2, and indeed the mouse study that they built upon did not find a phenotype in SYN2 heterozygous knock-out mice [8]. But overall, this study was a sign that my predicted frameshift mutation might really be playing a significant functional role.

Given that I also had access to SNP data from both of my parents through 23&Me, my next step was to find out which of them I inherited the predicted SYN2 frameshift variant from, so that I could figure out which of my parents I would be able to subsequently blame for all of my problems. But this is where things took another unexpected turn. In order to discover which of my parents was the culprit, I had to analyze the raw reads in the Integrated Genome Viewer (IGV), to find another tagging SNP that I could also see in the data from 23&Me. But when I actually looked at the reads, what I discovered here instead was way more homozygous variation (seen via the single-colored vertical lines) relative to the reference genome than I expected:

Screen Shot 2016-02-22 at 6.30.38 PM

This homozygosity of the variants is surprising and makes us suspicious that maybe there’s something going on other than just the mutation – maybe there was a problem in aligning my reads to the reference genome. And indeed, for technical reasons that are beyond the scope of this essay, in class we aligned to the hg19 build of the reference genome, which as it turns out, happens to differ from the hg38 reference genome at this region pretty substantially. And when I aligned one of the individual sequencing reads against the hg38 reference at this location, what I detected was not a deletion, but rather an insertion of 12 base pairs. Since 3 divided by 12 is a whole number, 4, that means that this is an in-frame mutation, which is much less likely to have the serious loss-of-function effect that a frameshift mutation would. And indeed, looking at the DNA sequence that was inserted, it appears that the insertion is probably due to a tandem repeat, with one mismatch:

Screen Shot 2016-02-22 at 6.29.56 PM

So, to recapitulate, analyzing the raw reads using the updated reference genome, I found out that likely I do not have a frameshift mutation in SYN2 after all. That said, the potential presence of a tandem repeat expansion within the coding sequence of SYN2 – leading to four extra amino acids in that protein – is itself pretty interesting and could still have some sort of a biological effect. After all, this protein is likely a key component of the cargo truck for my neurotransmitters.

In summary, I think I can say that if you’ve had your SNP data analyzed, that’s going make up the lion’s share of digestible information. However, there are likely to be some interesting things for you to learn from having your WGS data analyzed as well. First, although I didn’t/haven’t yet found any rare variants in my genome that might significantly increase my risk of disease in a potentially actionable way, I certainly could have. You don’t bring a life jacket on a boat because you think you’re going to fall overboard – you bring it because you might. Second, it was enlightening to learn first-hand about the lack of adequate tools for analyzing genomes, especially at the variant calling and variant analysis steps. We really are in the Wild West era of genomics. This is both exciting and motivating. I now have a better idea of what it is like to have a likely false positive variant call like I had with SYN2.

Finally, getting your genome sequenced isn’t just about your own health – it’s also about your family’s health and the health of society at large. For example, I’m also in the process of donating my whole genome sequencing data to the Personal Genome Project (I’ve already put up my VCF file). If you have access to SNP data and/or you want to try to have your whole genome sequenced, and you are willing to make the data publically available, then you should consider joining too. I think that by pooling genome and phenotype data in an open way, we’re going to make some discoveries that will improve human health in a big way.


[1]: Linderman MD, Bashir A, Diaz GA, et al. Preparing the next generation of genomicists: a laboratory-style course in medical genomics. BMC Med Genomics. 2015;8:47.

[2]: whole-genome-barrier-300150585.html

[3]: Macarthur DG, Balasubramanian S, Frankish A, et al. A systematic survey of loss- of-function variants in human protein-coding genes. Science. 2012;335(6070):823- 8.

[4]: Kong A, Frigge ML, Masson G, et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 2012;488(7412):471-5.

[5]: You might be wondering, if you’re male, then why are you worried about a BRCA mutation? Well, although BRCA mutations are much more dangerous in women, they can also increase the risk of certain cancer types in men. For example, according to one study with 1000 participants, there is around a 5-fold increased risk for prostate cancer in men with a BRCA2 mutation. See: Kote-jarai Z, Leongamornlert D, Saunders E, et al. BRCA2 is a moderate penetrance gene contributing to young-onset prostate cancer: implications for genetic testing in prostate cancer patients. Br J Cancer. 2011;105(8):1230-4.

[6]: Zhang Y, Chen K, Sloan SA, et al. An RNA-sequencing transcriptome and splicing database of glia, neurons, and vascular cells of the cerebral cortex. J Neurosci. 2014;34(36):11929-47.

[7]: Corradi A, Fadda M, Piton A, et al. SYN2 is an autism predisposing gene: loss-of- function mutations alter synaptic vesicle cycling and axon outgrowth. Hum Mol Genet. 2014;23(1):90-103.

[8]: Greco B, Managò F, Tucci V, Kao HT, Valtorta F, Benfenati F. Autism-related behavioral abnormalities in synapsin knockout mice. Behav Brain Res. 2013;251:65- 74.


Five years of spaced repetition flashcards

Attention conservation notice: Excessive navel-gazing.

First, some shameless Anki nerd bragging:

Screen Shot 2016-02-05 at 12.16.42 PM

Screen Shot 2016-02-05 at 12.16.49 PM

Screen Shot 2016-02-05 at 12.16.59 PM

Screen Shot 2016-02-05 at 12.17.06 PM

Screen Shot 2016-02-05 at 12.17.22 PM

Although I’ve been doing spaced repetition via Anki for 5 years now, I actually wanted to start doing SR flashcards about 8 years ago — approximately speaking, since 5/6/08, when I read this Wired article about Piotr Wozniak. Wozniak truly devoted himself to spaced repetition. He just generally seemed like an interesting person, his technique seemed like a good idea, and I wanted to do it [1].

For the next 2.5 to 3 years I had a pretty constant low-level of guilt/anxiety about how I should be doing SR flashcards. This anxiety spiked whenever I forgot something I had previously learned or clicked on a purple Wikipedia link. Of course, after a year or so, I thought it was too late and that I was a lost cause with respect to spaced repetition.

Coincidentally, I now have a flashcard for how to solve this exact problem. It’s called “future weaponry” — instead of thinking about what you could have done with a particular tool/knowledge/ability in the past, choose to think about what you will be able to do with it in the future.

Spaced repetition flashcards weren’t really feasible, though, until I got a smartphone. Say what you want about smartphones with respect to productivity overall, but the ability to do SR flashcards on them while walking and waiting around is insanely crucial and underrated. And this basically requires easy syncing via an internet connection.

Looking back at my cards from when I started, they’re pretty terrible. For example, I used tons of cloze deletion cards, a lazy and less effective way of making flashcards, rather than thinking about what knowledge I really wanted to retain and framing it in question form. I was also obsessed with memorizing math equations even though these have been pretty much completely useless.

That said, by far the most important thing for me back then was to maintain motivation, and I’ve been able to do that so far. You can find some of my spaced repetition flashcards cards for statistics, R programming, and other topics here.

The spike in flashcards that you see the middle of this time period was due to studying during the first two years of med school, for which I made and did a lot of flashcards. I think studying this way actually made me less good at any individual exam, since other cramming methods are more effective, but my hope is that it will help me to retain the knowledge better in the long run.

During this time, my friends and I were also studying for a big med school exam, called Step 1, and which we referred to as “D-Day.” Watching the documentary Somm, that is definitely what our lives were like, especially the us-against-the-world feeling. Spaced repetition flashcards were a big part of it. On any given day we usually had 400+ flashcards to do, which we called “slaying the beast.” We all ultimately did passed and did pretty well on Step 1, although looking back it was more about the journey.

Since starting full-time on my PhD program a year and a half ago, I’ve been using SR flashcards for two purposes:

  1. Learning programming languages, both the syntax and the concepts. The syntax has probably been more useful, but learning vocabulary related to the concepts has also been especially high yield for knowing what to search for.
  2. Learning about my research topics, e.g., Alzheimer’s and genomics. The jury is still out on whether this is a good use of time, but in general, I would say not to sleep on the potential for it if you’re a researcher. A lot of people like to read, but there’s a lot of value to be gained from systematically reflecting upon what you’ve read, especially if you have an imperfect memory like me.

I owe a lot to Damien Elmes, who wrote the open-source, free Anki software. I also owe a lot to Gwern, who wrote about spaced repetition extensively, and who made thousands of his Mnemosyne cards available to anyone to download for free. I downloaded these one day on a whim, converted them to Anki, and that was what really made me think of having my own cards as being a realistic, practical option. Thanks, Damien and Gwern.

[1]: The other thing I remember from that article, 8 years later: I often think about how he tries to minimize how often he drove in cars.

New page on biomedical trade-offs

Throughout my first two years of med school, I was surprised by how many of the most tricky — and to me, most interesting — topics in medicine involved some sort of underlying trade-off. For example, I couldn’t understand dynamic compression of the airways pretty much at all until I realized that it was a prototypical trade-off, in that higher expiration rates help push out CO2-enriched air faster, but also lead to a higher risk of airway collapse. Today I added a new page with a lot of these biomedical trade-offs, which is currently at 16 trade-offs, but I’m planning on adding more as I learn more. Hopefully somebody will find them useful, even if it’s just my future self.

Classic Papers #1: On the diagram, by John Venn

Title: “On the diagrammatic and mechanical representation of propositions and reasonings

Author: John Venn

Journal: Philosophical Magazine

Date published: July 1880

Builds upon: Euler diagrams

Citations (Google Scholar): 336

Citations Since 2010: 152

Best figure: Not satisfied with two sets, he jumped right to the symmetry-preserving extreme of his system and drew out a four set intersection with a place for labels:

Screen Shot 2016-01-12 at 6.32.48 PM

This would hardly be out of place in a genomics article published today.

Best sentence: “The fact is, as I have explained at length in the article above referred to, that the five distinct relations of classes to one another (viz. the inclusion of X in Y, their coextension, the inclusion of Y in X, their intersection, and their mutual exclusion), which are thus pictured by these circular diagrams, rest upon a totally distinct view as to the import of a proposition from that which underlies the statements of common life and common logic.”

Oddest moment: “I have no high estimate myself of the interest or importance of what are sometimes called logical machines, and this on two grounds. In the first place, it is very seldom that intricate logical calculations are practically forced upon us; it is rather we who look about for complicated examples in order to illustrate our rules and methods. In this respect logical calculations stand in marked contrast with those of mathematics, where economical devices of any kind may subserve a really valuable purpose by enabling us to avoid otherwise inevitable labour. Moreover, in the second place, it does not seem to me that any contrivances at present known or likely to be discovered really deserve the name of logical machines. It is but a very small part of the entire process which goes to form a piece of reasoning which they are capable of performing.”

(No wonder Turing proposed his test of whether something was a “true” AI.)

(Runner up: Venn’s use of the word “especial” instead of “special.”)

Lasting impact: This paper is a classic that jumps directly to the tough questions of how to visualize sets and set differences and even directly addresses the utility of a sort of artificial intelligence, or as Venn calls it, a “logical machine.” And of course, it introduced what we now know as the Venn diagram.

Editorial note: This is the first entry in what will hopefully be a series of classic papers cutting across disciplines that I’m interested in. For some reason, papers don’t seem to be discussed as commonly as books in my circles, which is strange because they’re shorter, usually more novel, and more information dense. This series is an attempt to write the blog posts I want to see in the world.

Nine paradoxes with a statistical theme


  1. A drill sergeant always yells at one of his trainees when she messes up. The drill sergeant notices that after he yells at her, her performance improves. Later it turns out that the trainee is deaf, blind, and has no other way of actually noticing that drill sergeant is yelling at her. Ignoring the effect of practice, why might the trainee’s performance have improved anyway?
  2. You have 100 pounds of Martian potatoes, which are 99 percent water by weight. You let them dehydrate until they’re 98 percent water by weight. How much do they weigh now and why?
  3. Imagine that your parents had rolled a six-sided die to decide how many children to have. What did they most likely roll and why?
  4. You have access to planes that have returned from military missions and the distribution of the bullet “wounds” on the planes. Which areas should you recommend to have extra armor?
  5. Why would few people choose to play in a lottery with a small but actual probability of success with an infinite monetary expected value?
  6. Do most people have have the same, more, or fewer friends than their friends have on average and why? 
  7. Hypothetically, say that 80% of people dream in color, and 68% of sexual partners have the same (concordant) coloring of their dreams. If you dream in color, what’s the probability that your partner will too?
  8. Are we biased to think that cars in the lanes next to us are going faster or slower than they really are and why? 
  9. Why is the expression “the smallest positive integer not nameable in under eleven words” paradoxical?


  1. regression to the mean — the screwup is likely a random deviation below the trainee’s average, which will tend to improve on the subsequent iteration just due to random chance, regardless of any action by the drill sergeant (more here
  2. 50 pounds, since the percentage of non-water by weight has doubled, so the overall weight must have halved (more here
  3. they most likely rolled a six, because there’s a higher chance of you existing to observe the event in that case (more here
  4. the areas with no damage, because of selection effects — planes that fell likely suffered an attack in a place that was untouched on those that survived (more here
  5. because the marginal utility of money is diminishing (more here)
  6. fewer, because sampling bias suggests that people with greater numbers of friends have an increased likelihood of being observed among one’s own friends (more here
  7. 80%. Since basic probability theory tells us that 0.8 * 0.8 + 0.2 * 0.2 = 0.68, so we know that the probability of dreaming is color is independent of that of your sexual partner. Therefore, the probability that your partner dreams is color is independent of yours and is simply the base rate. Some people think 68%, perhaps because they are getting wrapped up in the causal story. (more here)
  8. we are biased to think they are going faster, likely because because more time is generally spent being overtaken by other vehicles than is spent in overtaking them (more here)
  9. there are finitely many words, so there are finitely many numbers that can be defined in under eleven words, so there must be such an integer, but since this expression itself is under eleven words, there cannot be any such integer (more here; resolved by assigning priority to the naming process either within or outside of the expression) 
Screen Shot 2015-12-29 at 7.50.09 PM

these are totally Martian potatoes;

Five take aways on a history of automated ECG interpretation

Follow-up toTechnical debt: probably the main roadblack in applying machine learning to medicine

As a friend points out, the question “can a computer even accurately diagnose an ECG reading?” is one of the most common questions that medical students and doctors ask when the topic of machine learning in medicine comes up.

With that in mind, I found Pentti Rautaharju’s recent take on the history of automated ECG interpretation very enlightening. Here are my five take-aways:

  1. In 1991, a database of 1220 ECGs on clinically validated cases of myocardial infarcations, ventricular hypertrophies, and combinations of the conditions was used to compare 9 ECG programs to 8 cardiologists. Four of the programs were within 10% of the 67% accuracy of the reference cardiologist, and the best program had 3% higher accuracy. Notably, this doesn’t mean that the program was actually better, because we need to take into account multiple hypothesis testing. It does seem to warrant a repeat test, however.
  2. In attempting to follow-up on #1, I found the dearth of healthcare provider vs computerized interpretation comparisons in the literature is surprising. I haven’t been able to find many in a search on PubMed. Instead, there’s a fairly large literature of comparisons of people’s ECG interpretation accuracy at different stages of training. Clearly there will be many ways in which computers are worse than providers (e.g., for rare diseases), but it’s important to know when, where, and why.
  3. It was interesting to me, although in hindsight not surprising, that the use of most ECG interpretation programs has been through via major ECG manufacturers. It’s disappointing that the methods and accuracy of these were often not published, so that best practices couldn’t be adopted by other teams far away. Theoretically, software is scalable, but it doesn’t seem to have been harnessed in that way in this context. Related question: why is so little software in medicine open-source?
  4. The correction of computer errors in computerized ECG interpretation is called overreading. Medicare only adds $8 for overreading an ECG, and overreaders are often not available locally. Rautaharju suggests, quite interestingly, that it’d be possible for a 24-hour service to exist online where ECG overreading could be performed.
  5. Another area in which computerized approaches have a potential advantage — because they are faster — is in serial ECG interpretation. This speaks to a general point: the more longitudinal data available on a patient, the more value a machine learning-inspired approach is likely to be able to provide.


Rautaharju PM. Eyewitness to history: Landmarks in the development of computerized electrocardiography. J Electrocardiol. 2015;


Six weeks of exercise leads to increases in myelination

An interesting article by Thomas et al. recently measured the effect of six weeks of exercise on neuroimaging measures. Contra to expectations, they found that cerebral blood volume did not significantly change, while white matter volumes did show a significant increase. This suggests that the increases in cognition and hippocampal volume that occur following exercise may be due more to myelin changes than general blood flow changes in the brain per se.

Since blood perfusion is especially poor in white matter regions, it may have been that their study wasn’t able to detect finely grained improvements in blood flow following exercise (which we would have expected from theory and previous studies). Instead, it’s possible that their study detected increased white matter volume as a key consequence of improved blood flow in the brain selectively in the poorly perfused white matter brain regions. Either way, this is interesting data and helps illuminate the mechanisms behind a healthy amount of aerobic exercise.


Thomas AG, Dennis A, Rawlings NB, et al. Multi-modal characterization of rapid anterior hippocampal volume increase associated with aerobic exercise. Neuroimage. 2015.

Technical debt: probably the main roadblack in applying machine learning to medicine

Attention conservation notice: These are just loose thoughts on a topic that I’ve been thinking about for a few years. It’s important to point out that I’m far from an expert in either machine learning or medicine. Also, I’m not a doctor.

When I was applying to med school four years ago, I often wondered why machine learning wasn’t used more commonly in medicine. I thought to myself, if I can get a statistical recommendation of what movie I will most likely enjoy from Netflix, why can’t I get a statistical recommendation for what treatment will most likely help me at the doctor?

As I went through the first two years of med school and memorized countless facts about physiology and pathophysiology, I occasionally did some more research on the topic. For example, when we were learning about interpreting ECGs, I read about trials showing that the best computer programs for ECG interpretation nearly matched those of cardiologists. And after we learned about all of the antibiotics with which you treat each of the major infections, I read about the trials showing that a computer-based system outperformed experts in choosing an antimicrobial to treat meningitis.

Lest you think that these are new capacities that medicine hasn’t had a chance to catch up with yet, let’s be clear about the publication dates on those articles: the ECG comparison was published in 1991, and the infectious disease treatment comparison was published in 1979. That’s long enough for one of the authors of the latter paper, Robert Blum, to spend a whole career practicing emergency medicine.

Indeed, over the past 20-40 years, due to improvements in computing and statistics, our capacity to do accurate machine learning seems to have gotten way, way better. Computers now beat humans in checkers and chess. Computers can drive cars in six U.S. states. As a sign of the times, software recruiting agencies, looking for people more interested in a diverse set of topics, report that there is an “epidemic” of interest in machine learning among college graduates.

So why do healthcare professionals still get paid in part for their ability to look at an image and make a judgment call about what is going on in it, or do one of the other countless things in medicine that could theoretically also be done by a computer?

A cynical explanation is that healthcare providers such as doctors are blocking the application of machine learning to medicine out of self-interest. But for a number of reasons, I think that’s misguided [1]. Medicine is not perfect, but I think almost all healthcare providers really do want to give better healthcare to their patients. To wit, many of the articles written about new types of machine learning in medicine are written by healthcare providers who would like to see improvements in clinical practice. Moreover, healthcare providers don’t ultimately call the shots — hospital and reimbursement administrators do. And it seems that most of them would be quite happy — for good reason — to find ways to reliably decrease cost while maintaining quality of care, even if it required decreasing the hours of or even firing some employees.

Instead, I think the main reason that machine learning hasn’t been applied very broadly in medicine is that one that I had a vague feeling about but couldn’t really articulate well until I read Sculley et al.’s 2015 paper, “Hidden Technical Debt in Machine Learning Systems” — pdf here.

Technical debt is a wonderful phrase which nicely conveys the idea that in software development you can often make quick progress by cutting corners to ship your code — such as ignoring edge cases and not unit testing exhaustively — that will eventually need to be paid off in more time and hours of development down the road.

In their article, Sculley et al. describe how technical debt is particularly pernicious in machine learning, because in addition to the software side being capable of decay, machine learning also has the data side on which the models are trained. This data, too, is capable of decay and a host of problems as time goes on. The authors point out a number of specific ways in which machine learning algorithms can decay over time, and here are the four that I think are most relevant to medicine:

  1. Entanglement: This refers to the phenomenon of a shift in one input variable being liable to affect other variables in the system. Since no two input variables are ever truly independent, changing the distribution of one variable is likely to dramatically alter the predictive ability of a pre-specified machine learning algorithm. In medicine, this is a huge problem, because disease dynamics and correlations between medical variables can change rapidly. To take infectious disease as an obvious example, consider the recent increase in chikungunya virus in the U.S.. Or consider the impact of a particularly severe flu season, +/- an effective vaccine. And as treatments, diagnosis rates, or risk factors for chronic diseases change, their distributions are also liable to become radically different over short timeframes. Because nearly everything in medicine is correlated with everything, the CACE principle that “Changing Anything Changes Everything” of Sculley et al. very much applies to any machine learning algorithm in medicine.
  2. Unstable Data Dependencies. The input data sources for any machine learning algorithm in medicine are likely to be messy and capricious. An important example of this are electronic medical records, especially patient notes, which healthcare providers write about patient encounters. These notes are highly unstructured and nearly impossible to parse, which makes it a common summer research project for med students to read through a bunch of charts, extract relevant data from each patient, and put it into a spreadsheet. Any NLP approach here will need to be updated for different locations, times, and healthcare providers. And this problem of messy, unstable data is likely to be a difficult problem for a long time, because of the huge need for privacy in health data makes health data less likely to be open and thus standardized. 
  3. Feedback Loops. Let’s imagine that a machine learning algorithm started to be used for diagnosis or management of a set of patients. How would this change the input data streams and/or how would it change the output of other machine learning algorithms in medicine? Theoretically, this could and probably should be addressed via a simulation-based “sandbox” that contains the full ecosystem of algorithms in a given health system along with simulated patient encounters. I don’t know of any such system, however.
  4. Abstraction Debt. The lack of abstract ways of understanding the operations of most machine learning algorithms is striking. Indeed, many of the machine learning competitions are won by teams that use ensembles of methods, each of which even individually can’t be understood well. Especially in the transition period when machine learning methods become more commonly used in medicine, it’s going to be crucial to figure out useful abstractions so that the people actually applying the algorithms to individual patients can understand them and thus make sure they are working as expected and debug any problems.

None of these four factors have been robustly addressed by any of the machine learning approaches to medicine that I’ve seen. And what type of system would solve them? For starters, here’s a quick wish list of features for a robust and healthy machine learning algorithm that would be used in medicine:

  1. Consistently higher sensitivity and/or specificity than a realistic, collaborating team of humans [2].
  2. Validated to work on novel, quality-controlled, gold-standard data sets sampled from current patient populations every so often (e.g., every year).
  3. Stably accurate over a number of years, both retrospectively and prospectively.
  4. The methods should be published in the open and peer-reviewable. To me, it doesn’t have to be published in a journal — a paper on a pre-print survey would work too. The key is to allow for post-publication peer review.
  5. Open-source, in part so that it can be picked apart for possibly unstable predictor variables.
  6. Alongside open-source, its results must be reproducible. For example, this means that methods which rely on random seeds must have enough samples so that the starting point is irrelevant.

Although none of these wishlist items are easy or common, to me, the hardest part of this is likely #2: the data collection and validation. How are we going to assemble high-quality, pseudo-anonymized, gold-standard data sets that can be used by, at a minimum, authorized researchers? And how can we make sure that this happens year-in and year-out? That won’t be easy.

That said, long-term, I’m quite bullish on machine learning in medical diagnosis and management. I think it will happen, and I think it will be a great advance that improves well-being and saves lives. Indeed, as machine learning technical debt problems are addressed adequately in other domains, it will increase societal confidence and motivation to apply them to medicine as well. Face recognition systems (e.g., on Facebook) and self-driving cars both come to mind here.

To summarize, machine learning has long had a lot of potential in medicine. But in order for it to be really utilized, we’ll need to develop much better data sets and algorithms that can overcome the problems of technical debt that would otherwise accumulate. Until then, I guess us med students will just have to keep on memorizing countless numbers of facts for our exams.


[1]: Though to be fair, since I’m in med school, I’m obviously biased on this point. I feel like I’m adjusting for this, but maybe I’m not adjusting enough.

[2]: Because if your ML algorithm can’t beat humans, then what’s the point?


Sculley et al., 2015. Hidden Technical Debt in Machine Learning Systems.

Willems JL, Abreu-lima C, Arnaud P, et al. The diagnostic performance of computer programs for the interpretation of electrocardiograms. N Engl J Med. 1991;325(25):1767-73.

Yu VL, Fagan LM, Wraith SM, et al. Antimicrobial selection by a computer. A blinded evaluation by infectious diseases experts. JAMA. 1979;242(12):1279-82.

Single formaldehyde-fixed cells retain strong epigenomic signals

Attention conservation notice: Basically only interesting if you care about the preservation quality of cells and tissues following formaldehyde treatment.

This past week Jin et al published good evidence that the epigenome of thyroid follicular cells is maintained in at least a subset of single cells.

As you can see in Fig 1b in the original paper, the periodic DNA hypersensitivity pattern is retained in single formaldehyde-fixed postmortem cells.

Just as a nerdy sidenote, as far as I understand, this periodicity is due to DNA wrapping around the histone core particle in nucleosomes.

Notably, this paper does not show that formaldehyde does not alter the epigenome of cells at all — but it does show that the effect is not to radically alter the structure.

I personally am most interested in how these results translate to brain cells and tissue. Here are a couple of points on that front:

1) Formaldehyde-preserved tissue has already shown to preserve epigenomic signals, such as those marked by anti-DNA methylation antibodies.

2) However, there might be differences between the variability of responses from single thyroid follicular cells and single brain cells in their response to formaldehyde, and that (to the best of my knowledge) has not yet been tested.

Anyway, this paper was interesting to me because the epigenome holds a lot of information about the function brain cells, and can even specify some synapse information relevant to memory. The ability to preserve this via formaldehyde is an important point for designing research studies.

Clinical trial evidence related to calling Alzheimer’s “Type 3 Diabetes”

Attention conservation notice: These are just a few impressions from only 2-3 years of following the AD field — I’m certainly not an expert in any of this. Also, I am not a doctor.

The evidence behind calling Alzheimer’s “type 3 diabetes” is at least two-fold. First, insulin and insulin-like growth factor pathways are thought to decrease in levels in AD patients. Second, diet and vascular risk factors are strongly linked to AD diagnosis.

The purpose of this post is that I’m trying to learn about the idea and how valuable I think it is as a framework for AD.

First, let’s look at some history. On Pubmed, there are 48 results for “type 3 diabetes” OR “type III diabetes”. Which, out of 83658 total hits for “Alzheimer’s”, is not that many.

The first mention on PubMed is in 2000. Although I don’t have access to the full-text, I don’t think it’s about Alzheimer’s.

Instead, it seems that the first mention is in 2005, by Steen et al. [1] — and indeed, in their abstract they say that they are coining the term. Their argument is that insulin-related proteins, especially IGF-I and IGF-II, have reduced expression in postmortem human brain tissue from AD patients, suggesting a lack of sufficient insulin in the brain.

To get a sense of how it has seeped into the public consciousness, let’s look at how much people are searching for this term using Google trends:

Screen Shot 2015-11-25 at 8.35.27 PM

searches for “type 3 diabetes” on Google trends

So I don’t think it is old news.

Yesterday, Gabrielle Strobel from AlzForum reported on two pieces of data from CTAD 2015 that seem relevant to the case for Alzheimer’s as type 3 diabetes:

  1. Metformin (a drug that increases insulin sensitivity and is used to treat type 2 diabetes) was ineffective in improving cerebral perfusion in 20 patients with mild-to-moderate AD.
  2. Intranasal Determir (an insulin analog) did not have an improvement on memory or MRI volume, but regular insulin did.

There are also a couple of other relevant pieces of data:

  1. Previously, numerous studies have shown that intranasal insulin leads to memory improvements in AD patients (e.g., here, here, and here).
  2. The SNIFF trial is a study of year-long treatment of twice-daily intranasal insulin with 240 enrollees nation-wide. The results are set to be in in February 2016, so we should hear about it soon.

Notably, intranasal insulin has also been shown to improve cognition in non-AD trials:

  1. It improves cognition in people with type 2 diabetes (e.g., here).
  2. In rats, it improves cognition in normal aging (here).
  3. It improves some measures of cognition in healthy subjects age 18-34 (n = 38; here; note: this study did not correct for multiple hypothesis tests).

Since mixed vascular and Alzheimer’s dementia is probably the most common form of dementia (e.g. see here), and vascular dementia is often related to poor perfusion in part due to insulin-related metabolic problems, it makes total sense that intranasal insulin would help to improve cerebral vascular function and thus appear to be helping the memory of AD patients, when in reality it is helping the mixed vascular dementia component.

So although I definitely hope that the SNIFF trial receives positive results on cognition, I don’t think it’s fair even in that case to call it as necessarily a “win” for the case for Alzheimer’s as type 3 diabetes.

For that to be the case, I’d want to see better data that not only is cognition improving following insulin treatment, but also that measures of AD pathology such as amyloid and tau are improving. This seems to be one of the major goals of the SNIFF trial, since they are also measuring amyloid and tau in the CSF of the enrollees.

So, in an attempt to quantify my actual beliefs, and knowing absolutely nothing about the SNIFF trial itself (i.e., this is based solely on public info, mostly listed above), I predict with 60% probability that the trial will show a significant improvement in cognition. I also predict that neither CSF Abeta nor Abeta/tau ratios will change significantly based on treatment, this time with 75% probability.

Since the results are meant to be completed in February 2016, we should know the actual results by the end of 2016 latest.

Although the terminology, classification, and mechanisms around AD are extremely important for research priorities, the most important thing is to get better therapies for all types of dementia into clinical practice ASAP. And on that note, hopefully intranasal insulin will turn out to be a really valuable therapy for patients.