To what extent does whole genome sequencing add value above SNP arrays?

Attention conservation notice: I wrote this as the final essay for my course on personal genomics with Michael Linderman at Mt Sinai. The main question of the essay was: what is the point, in 2016, of getting your whole genome sequencing (WGS) data, if you already have your SNP data? Overall, I found analyzing my WGS data an interesting experience, but the vast majority of known genomic info is still at the SNP level, and there are some bugs in contemporary variant callers that make WGS calls more likely to be false-positives, as I experienced first-hand.

This fall, I was lucky enough to be a part of a course at ISMMS where we learned about genomics by analyzing our own whole genome sequencing results, which was graciously paid for by the school [1]. Amazingly, the cost of genome sequencing dropped from $5000 to $1500 (or even $1000) in just this past year, but it’s still a significant investment in our education by ISMMS and I appreciate it. According to the course director, Michael Linderman, there’s only on the order of around 1000 people with access to their own whole genome sequencing results and the ability to interpret them, which puts us in a pretty small, fortunate group. That said, it probably won’t be a small group for long, since just over the past few months, Veritas Genetics announced that it will offer WGS alongside analysis commercially for a new low-price of $999 [2].

I already had access to single-nucleotide polymorphism (SNP) array results from 23&Me, so a very basic question was what kind of data I could get from having my whole genome sequenced that I didn’t already have access to. First, some terminology: the difference between a SNP and a normal genetic variant is that the alternate allele of a SNP must be present in at least 1% of the population. Not surprisingly, most of the papers published about genomics on PubMed study the effect of SNPs, in large part because those are the variants for which there is sufficient power to address biomedical questions robustly. So I already had access to the majority of the well-studied variants through my SNP data. So from one perspective, going from the ~300,000 SNPs that I got from 23&Me to the ~3,000,000,000 base pair calls in the human genome seems like a classic case of the big data trap: collecting more data without any point. And I’ll freely admit that I’ve fallen victim to this tendency at least a few times in my life.

Upon a little bit more literature and soul searching about what I expected to learn, it became apparent that what whole genome sequencing is best at is detecting very private variants – that is, unsurprisingly, things that are present in less than 1% of the population. Any such rare variants that I would found might be present just in my immediate family a countable number of generations back, or they might even be found only in me. But these rare variants can add up to a fairly non-trivial number. As it turns out, the average person has about 100 heterozygous loss of function variants, which includes stop insertions, frameshift mutations, splicing mutations, and large deletions [3]. And since my dad was on the older side when I was born, and older male age is associated with more new genetic variants [4], I knew that I was liable to have an especially large burden of new variants.

On the big day when our sequences had been finally aligned and the variants had been called, the first thing I did was to filter those variants down to the 2000 or so ones that were most likely to be damaging. I scanned down the gene list meticulously, looking for gene names that I recognized. Since I had to memorize a fairly large number of disease-causing genes during my preclinical med school courses, I figured recognizing a gene name would in general be a bad sign. I was relieved and felt lucky to discover no major disease-causing mutations in genes that I knew would cause major disease, such as the cancer-promoting genes BRCA1/2 [5]. Overall this process was not very efficient, but it was pretty fun.

The next time that I sat down to analyze my genetic variants, I decided to filter for variants that were likely to have an effect on the way I think. So I intersected the genes in which I had predicted function-altering variants with another list from a study [6] that measured which genes have the highest RNA expression – a proxy for “are made the most” – in neurons. Here’s a plot of the results:

Screen Shot 2016-02-22 at 6.31.02 PM

The green dot represents the gene in which I have a predicted damaging mutation with the strongest expression in neurons, which is the gene SYN2. The protein that this gene codes for is thought to be selectively produced in synapses, where it probably plays a role in synaptic vesicle transport [7]. Synaptic vesicles, in turn, are what neurotransmitters are stored in before are they are released into the synaptic cleft to communicate with the postsynaptic neuron. You might think of them as the “cargo trucks” of the synapse, storing and carrying around the payload of neurotransmitters before they are sent to the next neuron. So naturally, I became curious about what the effect of that variant might be.

First, I took a look at what my actual predicted variant in the SYN2 gene was. Specifically, I was predicted to have a frameshift mutation, due to the deletion of a CGCGA sequence at chromosome 3, position 12,046,269. In general, frameshift mutations are pretty cool. DNA is made into proteins three nucleotides at a time, so mutations in multiples of three only alter a small number of amino acids. But if a frameshift mutation messes up this three nucleotide reading frame, then the whole rest of the protein is totally different. What was predicted to happen in my version of the SYN2 protein is that, 66 nucleotides later after the frameshift, a new stop signal was introduced. So I would have 22 amino acids in my version of SYN2 that are not found in most people, and then the protein was predicted to end. Although it’s fun to speculate that maybe those 22 amino acids could turn me into a mutant supergenius if I could just learn how to tap into its mythical synaptic powers, most likely my predicted mutant version of SYN2 would be simply degraded. And since I’m predicted to be heterozygous for the mutation, my non-mutated version of SYN2 could simply pick up the slack. That said, in the absence of compensation, I’d be expected to have ~50% less of this key synaptic protein than the average person.

Naturally, next I did a search for the functional role of a loss of function mutation in SYN2. The first paper I found [7] had the suddenly ominous title: “SYN2 is an autism predisposing gene: loss-of-function mutations alter synaptic vesicle cycling and axon outgrowth.” Specifically, this paper showed that two missense (amino-acid changing) and one frameshift mutation were found in male individuals with autism spectrum disorder, but none were found in male controls with autism spectrum disorder. They also showed that neurons lacking SYN2 have a lower number of synaptic vesicles ready to be released from their synapses, which is consistent with the predicted role of SYN2. I had some qualms about this paper, like the fact that they extrapolated from SYN2 homozygous knock-out mouse studies to humans that were heterozygous for a loss-of-function variant in SYN2, and indeed the mouse study that they built upon did not find a phenotype in SYN2 heterozygous knock-out mice [8]. But overall, this study was a sign that my predicted frameshift mutation might really be playing a significant functional role.

Given that I also had access to SNP data from both of my parents through 23&Me, my next step was to find out which of them I inherited the predicted SYN2 frameshift variant from, so that I could figure out which of my parents I would be able to subsequently blame for all of my problems. But this is where things took another unexpected turn. In order to discover which of my parents was the culprit, I had to analyze the raw reads in the Integrated Genome Viewer (IGV), to find another tagging SNP that I could also see in the data from 23&Me. But when I actually looked at the reads, what I discovered here instead was way more homozygous variation (seen via the single-colored vertical lines) relative to the reference genome than I expected:

Screen Shot 2016-02-22 at 6.30.38 PM

This homozygosity of the variants is surprising and makes us suspicious that maybe there’s something going on other than just the mutation – maybe there was a problem in aligning my reads to the reference genome. And indeed, for technical reasons that are beyond the scope of this essay, in class we aligned to the hg19 build of the reference genome, which as it turns out, happens to differ from the hg38 reference genome at this region pretty substantially. And when I aligned one of the individual sequencing reads against the hg38 reference at this location, what I detected was not a deletion, but rather an insertion of 12 base pairs. Since 3 divided by 12 is a whole number, 4, that means that this is an in-frame mutation, which is much less likely to have the serious loss-of-function effect that a frameshift mutation would. And indeed, looking at the DNA sequence that was inserted, it appears that the insertion is probably due to a tandem repeat, with one mismatch:

Screen Shot 2016-02-22 at 6.29.56 PM

So, to recapitulate, analyzing the raw reads using the updated reference genome, I found out that likely I do not have a frameshift mutation in SYN2 after all. That said, the potential presence of a tandem repeat expansion within the coding sequence of SYN2 – leading to four extra amino acids in that protein – is itself pretty interesting and could still have some sort of a biological effect. After all, this protein is likely a key component of the cargo truck for my neurotransmitters.

In summary, I think I can say that if you’ve had your SNP data analyzed, that’s going make up the lion’s share of digestible information. However, there are likely to be some interesting things for you to learn from having your WGS data analyzed as well. First, although I didn’t/haven’t yet found any rare variants in my genome that might significantly increase my risk of disease in a potentially actionable way, I certainly could have. You don’t bring a life jacket on a boat because you think you’re going to fall overboard – you bring it because you might. Second, it was enlightening to learn first-hand about the lack of adequate tools for analyzing genomes, especially at the variant calling and variant analysis steps. We really are in the Wild West era of genomics. This is both exciting and motivating. I now have a better idea of what it is like to have a likely false positive variant call like I had with SYN2.

Finally, getting your genome sequenced isn’t just about your own health – it’s also about your family’s health and the health of society at large. For example, I’m also in the process of donating my whole genome sequencing data to the Personal Genome Project (I’ve already put up my VCF file). If you have access to SNP data and/or you want to try to have your whole genome sequenced, and you are willing to make the data publically available, then you should consider joining too. I think that by pooling genome and phenotype data in an open way, we’re going to make some discoveries that will improve human health in a big way.


[1]: Linderman MD, Bashir A, Diaz GA, et al. Preparing the next generation of genomicists: a laboratory-style course in medical genomics. BMC Med Genomics. 2015;8:47.

[2]: whole-genome-barrier-300150585.html

[3]: Macarthur DG, Balasubramanian S, Frankish A, et al. A systematic survey of loss- of-function variants in human protein-coding genes. Science. 2012;335(6070):823- 8.

[4]: Kong A, Frigge ML, Masson G, et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 2012;488(7412):471-5.

[5]: You might be wondering, if you’re male, then why are you worried about a BRCA mutation? Well, although BRCA mutations are much more dangerous in women, they can also increase the risk of certain cancer types in men. For example, according to one study with 1000 participants, there is around a 5-fold increased risk for prostate cancer in men with a BRCA2 mutation. See: Kote-jarai Z, Leongamornlert D, Saunders E, et al. BRCA2 is a moderate penetrance gene contributing to young-onset prostate cancer: implications for genetic testing in prostate cancer patients. Br J Cancer. 2011;105(8):1230-4.

[6]: Zhang Y, Chen K, Sloan SA, et al. An RNA-sequencing transcriptome and splicing database of glia, neurons, and vascular cells of the cerebral cortex. J Neurosci. 2014;34(36):11929-47.

[7]: Corradi A, Fadda M, Piton A, et al. SYN2 is an autism predisposing gene: loss-of- function mutations alter synaptic vesicle cycling and axon outgrowth. Hum Mol Genet. 2014;23(1):90-103.

[8]: Greco B, Managò F, Tucci V, Kao HT, Valtorta F, Benfenati F. Autism-related behavioral abnormalities in synapsin knockout mice. Behav Brain Res. 2013;251:65- 74.


Genetic Associations with Neuroimaging Anatomic Variability

Gutman et al. recently published an interesting paper about the role of genetics in explaining surface variability in neuroimaging-defined structures. It’s common to consider the role of heritability in explaining facial structure — such as nose shape — but, perhaps since the brain is not so easily seen, it’s less common to imagine that heritability could explain variability in brain structures. And it raises the question, could this help explain variability in susceptibility to diseases such as Alzheimer’s as well?

Twelve Interesting Recent Papers

1) Wootla et al. discussing naturally occurring antibodies for treatment of CNS disorders. Naturally occurring antibodies are mainly IgM and bind to many different types of antigens with low affinity (that’s what happens when you don’t do any affinity maturation). One idea is that elderly people without AD (but with, say, risk factors such as APOE) may have more of these antibodies, that help clear amyloid, and that’s why they haven’t developed AD. In fact, one of the more promising current treatments for AD in trials, aducanumab, was originally derived from elderly donors without AD based on this hypothesis. A similar procedure is also being done in MS — e.g., the authors describe some antibodies that bind specifically to oligodendrocytes with the goal of promoting remyelination.

2) Cummings et al. describing good phase II trial results for dextromethorphan + quinidine for agitation in AD. Aside from being excited about a potential new treatment for an aspect of AD, I find this particularly interesting since I was previously involved in a project that evaluated the effects of recreational doses of DXM in the comments of YouTube videos. However, the recreational doses are much higher than the doses in this study (> 200 mg vs 30 mg, respectively), so the effects are probably radically different — as always, the dose makes the poison.

3) A couple of papers recently came out purporting to explain the role of the ApoE risk variant in AD, which is very important but still very much unknown. First, a really interesting paper from Zhu et al. shows that in APOE ɛ4 carriers, synj1 expression increases, which decreases the expression of phospholipids such as PIP2. This is similar to an ApoE-null phenotype, suggesting a loss of function phenotype. Second, Cudaback et al. show that ApoE allele status affects the astrocyte secretion of the microglial chemotaxis factor CCl3. Interestingly, the ɛ4 and ɛ2 alleles have a more similar effect than ɛ3 in their data.

4) Turner et al. present results from an RCT of resveratrol for AD, which finds some good effects in biomarkers, but is not a home run clinically. Although with only 119 participants, it is likely underpowered, and one of the four clinical measures had a p = 0.03 effect in the correct direction.

5) Tom Fagan at AlzForum does nice reporting on results from PET and neuropathology showing that, by both measures, around 25% of people clinically diagnosed with AD do not have high amyloid levels. This is higher in ApoE e4 non-carriers, which is what you’d expect based on conditional probability and clinicians not taking into account ApoE allele status into account when making their diagnosis. In the absence of amyloid, neurodegeneration appears to be fairly slow or absent.

6) Dale Bredesen continues his innovative work in AD, describing here case reports suggesting that there are three types of AD, one inflammatory, one metabolic (e.g., related to insulin resistance), and one related to zinc deficiency.

7) Moran et al. use ADNI data to show that Type 2 Diabetes is associated with CSF tau (explaining 15% of the T2DM-associated cortical thickness loss), but not CSF amyloid, suggesting that T2DM might be related to tau-only AD cases, and/or tau increases that are independent of amyloid.

8) Not AD, but still neurological, in frontotemporal dementia, Ahmed et al. report that fasting blood levels of agouti-related peptide (AgRP) are much higher in patients (~66.5 +/- 85) than in controls (~23 +/- 20). Furher, AgRP levels are correlated with BMI, suggesting that AgRP levels account for the increased eating behavior seen in some variants of FTD. Just interesting to see an example where the effect of hormones on eating behavior could be very strong.

9) Petrovski et al. used WGS data to define an interesting measure of “how tolerant a gene’s regulatory region has been to mutation across evolution.” Specifically, their measure (the “noncoding Residual Variation Intolerance Score”) measures how many common variants a gene has in its regulatory region compared to other genes with a similar mutation rate. They found that higher levels of this measure were significantly associated with genes that are annotated as haploinsufficient, meaning that this is a good way of describing how much cells care about what relative expression levels a gene has.

10) Zheng et al. also used WGS data and found that rare variants in the gene EN1 are significantly associated with the risk of bone fracture. To quantify the effects of rare variants (< 5% MAF) they also used an association test — SKAT — to measure associations of these variants with bone marrow density in windows of 30 bp’s, and found one significant gene with this procedure. Refreshingly they put their code for this analysis online, available here, I haven’t ran it but just want to say +1 to them for putting their wrapper code online. Interestingly, both this paper and the Petrovski paper use the GERP++ score for their evolutionary inference — that seems to be a common tool, check it out here.

11) In influenza news, Lakdawala et al. show that influenza A does a large amount (most?) of its replication in the soft palate, which is the fleshy, soft part in the back of your mouth. Total hindsight bias, but this “makes sense” to me when I think back to the times when I think I had the flu myself — that part of my mouth gets very irritated, and now this makes slightly more sense.

Generating new protein sequences with a character-level recurrent neural network

This past weekend, using Andrej Karpathy’s outrageously simple and helpful github repository [1], I trained a recurrent neural network on my laptop [2].

If you are reading this post in part because you want to do a similar thing, rest assured that by far the most time-consuming part was installing Torch7.

Well, don’t rest totally assured, because that process was actually pretty annoying for me [3].

Anyway, I wanted to train a character-level neural network because I was really impressed by Andrej’s generated Shakespearian sonnets, as well as Talcos’s generated MTG cards.

But, I wanted to train the RNN on something more biological. So as input data I used a 76 MB fasta file of the full set of human protein sequences, which is available for download via ftp here.

As with any fasta file, this is the format of the input data (I know this is boring, but it will become important later):

>gi|53828740|ref|NP_001005484.1| olfactory receptor 4F5 [Homo sapiens]


>gi|767901760|ref|XP_011542107.1| PREDICTED: uncharacterized protein LOC102725121 isoform X2 [Homo sapiens]


The lines with carets contain metadata about the following protein sequence.

The protein sequences themselves are made up of capital letters which refer to one of the twenty main amino acids that are coded for by the human genome.

After changing to the directory that I cloned Andrej’s github directory into, I trained the neural network using the following call:

th train.lua -data_dir data/sapiens_fasta/ -gpuid -1 -max_epochs 1 -eval_val_every 40

One epoch means that the neural network is run once through the input file, which almost certainly not enough times through the data for it to learn truly useful trends, but I was limited on CPU time, and running it through just once took almost 40 hours.

This RNN has no structural priors and certainly doesn’t know English, so it has to learn everything fresh from the data.

After 1/100 through the epoch, the RNN was struggling to learn basic aspects of the file format:




After 1/10 through the epoch, the RNN had the format mostly down, and had even learned that “XP” came before a predicted protein, while “NP” came from known (e.g., cDNA sequence) protein data.

>gi|670774209|ref|XP_011526108.1| PREDICTED: neinator uxtylcerin-1 isoform X1 [Homo sapiens]


>gi|115370400|ref|NP_001185001.1| oimh meeroribh CAE3 pomolphinase CA-ase popotecusution maclor domaiongbnating protein 2Hisoform 309 isoform X2 [Homo sapiens]


As you can see, it had also learned that (almost) all protein sequences start with an “M”, since methionine is coded for by the start codon, AUG, which the ribosome recognizes.

Finally, by the end of the epoch, the RNN has the format down and was predicting some protein names that are kind of in the “uncanny valley”:

>gi|38956652|ref|NP_001018931.1| hamine/transmerabulyryl-depenter protein 1 isoform 2 [Homo sapiens]


>gi|767959205|ref|XP_011517870.1| PREDICTED: D3 glutamate channel protein isoform X5 [Homo sapiens]


This RNN has a particular penchant for combining parts of names, and some of these actually make sense, like “receptorogen”, or “elongatase.”

I blasted ~10 of the protein sequences trained on the full epoch to see whether they had any evolutionary conservation, but none of them had any conservation above chance, suggesting that the RNN isn’t just repeating protein sequences.

I also did structure predictions on one of the generated protein sequences, and it is made up of one protein domain with a good template from the Protein Data Bank (PDB).

rnn pdb Here is what the protein is predicted to look like:


The arrows are a common way of referring to alpha helix secondary structure, which the generated protein has a reasonable amount of (31%; the average globular protein contains 30%).

It’s interesting to think about what applying RNNs on protein sequences or other sorts of biological data might accomplish.

For example, you could potentially feed the RNN a list of many anti-microbial proteins as training data, to try to generate new peptides that you could test as novel antibiotics.

[1]: See also this post where Andrej explains it in more detail and uses it to predict Shakespeare sonnets.

[2]:  A non-CUDA compatible MacBook Pro.

[3]: How to do this on a MacBook pro that is non-CUDA compatible: a) install/update homebrew, b) install/update lua, c) follow the command-line instructions at the torch7 website to install it, d) run source ~/.profile to load torch7 at the terminal (I ignored this part of the instructions and this is part of what made it take so long for me), e) get the necessary luarocks, f) fork/clone Andrej’s repo, and g) run the nn training and sampling commands with the -gpuid -1 option, since your machine is non-CUDA compatible (I also ignored this part of the instructions, to my vexation).


Morten Källberg, Haipeng Wang, Sheng Wang, Jian Peng, Zhiyong Wang, Hui Lu, and Jinbo Xu. Template-based protein structure modeling using the RaptorX web server. Nature Protocols 7, 1511-1522, 2012.

Kaparthy’s github repository: Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch. 2015.

An R package to retrieve evolutionary selection calculations for gene lists

Earlier this year I became interested in learning about which genes in the genome are more evolutionarily novel or selected on by evolution. This can be informative for learning about disease states.

For example, since Alzheimers dementia is present in humans, arguably some non-human primates (more on that in a subsequent post), and probably no other animals, genes coding proteins that are associated with its development for it are, all things equal, more likely to have had recent evolutionary selection. The APOE gene is a canonical example of this, since APOE alleles other than ε4 don’t exist in other animals. Of course, this is coarse, but it’s another layer of evidence in addition to many others.

A commonly used way of quantifying this is with the dn/ds ratio. This is computed via the quotient of the number of non-synonymous amino acid changes (thus, dN for Non-synonymous change) to the number of synonymous amino acid changes (thus, dS). As an example of what this ratio can tell you, any gene with dN/dS > 1 is usually considered to have undergone positive selection, an example being the FOXP2 gene, which has been associated with language development.

A common method for finding dn/ds values for particular genes in particular species is to use the estimates calculated by Ensembl, the methodology for which you can read about here, using the biomaRt query package. I’ve written some code to do this for a list of genes (just to be clear, in R terms, actually a character vector), which turned out to be mildly annoying, so I decided to make it available for others.

If you want to download this as a package, and you have the devtools packages installed, then the following R commands should do the trick:


This is still very much under development and was largely an exercise in me learning how to make my first package (for more, see Hilary Parker’s excellent tutorial here). Please let me know if you have any questions or problems!