Why Clinical Biomarkers for AD Matter

Frisoni et al. make this interesting point: In HIV and cancer, success in fighting the disease was heralded by biomarkers — CD4 counts and viral load for HIV, and a wide variety of tumor markers in cancer — that allowed researchers to do much faster iterations in clinical trials. The problem is that regulators don’t have good reason to trust these unless a useful agent can be shown to affect that biomarker while also improving clinical symptoms in AD. Once this has been shown — and the effect on that biomarker is fairly specific and/or biologically meaningful — then beyond the obvious clinical use, it will speed up drug development for AD tremendously. This, to me, is a big part of why candidate biomarker research in AD is so important.

Twelve Interesting Recent Papers

1) Wootla et al. discussing naturally occurring antibodies for treatment of CNS disorders. Naturally occurring antibodies are mainly IgM and bind to many different types of antigens with low affinity (that’s what happens when you don’t do any affinity maturation). One idea is that elderly people without AD (but with, say, risk factors such as APOE) may have more of these antibodies, that help clear amyloid, and that’s why they haven’t developed AD. In fact, one of the more promising current treatments for AD in trials, aducanumab, was originally derived from elderly donors without AD based on this hypothesis. A similar procedure is also being done in MS — e.g., the authors describe some antibodies that bind specifically to oligodendrocytes with the goal of promoting remyelination.

2) Cummings et al. describing good phase II trial results for dextromethorphan + quinidine for agitation in AD. Aside from being excited about a potential new treatment for an aspect of AD, I find this particularly interesting since I was previously involved in a project that evaluated the effects of recreational doses of DXM in the comments of YouTube videos. However, the recreational doses are much higher than the doses in this study (> 200 mg vs 30 mg, respectively), so the effects are probably radically different — as always, the dose makes the poison.

3) A couple of papers recently came out purporting to explain the role of the ApoE risk variant in AD, which is very important but still very much unknown. First, a really interesting paper from Zhu et al. shows that in APOE ɛ4 carriers, synj1 expression increases, which decreases the expression of phospholipids such as PIP2. This is similar to an ApoE-null phenotype, suggesting a loss of function phenotype. Second, Cudaback et al. show that ApoE allele status affects the astrocyte secretion of the microglial chemotaxis factor CCl3. Interestingly, the ɛ4 and ɛ2 alleles have a more similar effect than ɛ3 in their data.

4) Turner et al. present results from an RCT of resveratrol for AD, which finds some good effects in biomarkers, but is not a home run clinically. Although with only 119 participants, it is likely underpowered, and one of the four clinical measures had a p = 0.03 effect in the correct direction.

5) Tom Fagan at AlzForum does nice reporting on results from PET and neuropathology showing that, by both measures, around 25% of people clinically diagnosed with AD do not have high amyloid levels. This is higher in ApoE e4 non-carriers, which is what you’d expect based on conditional probability and clinicians not taking into account ApoE allele status into account when making their diagnosis. In the absence of amyloid, neurodegeneration appears to be fairly slow or absent.

6) Dale Bredesen continues his innovative work in AD, describing here case reports suggesting that there are three types of AD, one inflammatory, one metabolic (e.g., related to insulin resistance), and one related to zinc deficiency.

7) Moran et al. use ADNI data to show that Type 2 Diabetes is associated with CSF tau (explaining 15% of the T2DM-associated cortical thickness loss), but not CSF amyloid, suggesting that T2DM might be related to tau-only AD cases, and/or tau increases that are independent of amyloid.

8) Not AD, but still neurological, in frontotemporal dementia, Ahmed et al. report that fasting blood levels of agouti-related peptide (AgRP) are much higher in patients (~66.5 +/- 85) than in controls (~23 +/- 20). Furher, AgRP levels are correlated with BMI, suggesting that AgRP levels account for the increased eating behavior seen in some variants of FTD. Just interesting to see an example where the effect of hormones on eating behavior could be very strong.

9) Petrovski et al. used WGS data to define an interesting measure of “how tolerant a gene’s regulatory region has been to mutation across evolution.” Specifically, their measure (the “noncoding Residual Variation Intolerance Score”) measures how many common variants a gene has in its regulatory region compared to other genes with a similar mutation rate. They found that higher levels of this measure were significantly associated with genes that are annotated as haploinsufficient, meaning that this is a good way of describing how much cells care about what relative expression levels a gene has.

10) Zheng et al. also used WGS data and found that rare variants in the gene EN1 are significantly associated with the risk of bone fracture. To quantify the effects of rare variants (< 5% MAF) they also used an association test — SKAT — to measure associations of these variants with bone marrow density in windows of 30 bp’s, and found one significant gene with this procedure. Refreshingly they put their code for this analysis online, available here, I haven’t ran it but just want to say +1 to them for putting their wrapper code online. Interestingly, both this paper and the Petrovski paper use the GERP++ score for their evolutionary inference — that seems to be a common tool, check it out here.

11) In influenza news, Lakdawala et al. show that influenza A does a large amount (most?) of its replication in the soft palate, which is the fleshy, soft part in the back of your mouth. Total hindsight bias, but this “makes sense” to me when I think back to the times when I think I had the flu myself — that part of my mouth gets very irritated, and now this makes slightly more sense.

How to version control Microsoft Word documents among collaborators

Abstract: Just a quick little tool I use to version control my drafts; highly, highly unlikely to be of use unless you a) use git and want to regularly back up the text of your Word document and b) are working with collaborators who do not use git.


In an attempt to write more, I’m trying to remove barriers. One barrier is worry that if I delete or change something, I’ll want it back, which keeps me from making progress. Version control with git solves this nicely for code and simple text documents, but bulky Word documents do not play nicely with git.

Enter pandoc. This tool converts Word documents to the plain-text document formatting style Markdown. Installing it on Mac OSX 10.10 is pretty easy.

But what if you have a Word document in one folder but you want your Github repo in another folder? (Say, your Word doc is in Dropbox to share with collaborators or among computers easily, but you follow the recommendation to not use Github with Dropbox.) Enter this simple shell script for automating the push to Github:

cp Path_To_Paper/Paper.docx . 
pandoc -s Paper.docx -t markdown -o Paper.md 
git add . 
git commit -m "$1" 
git push origin master

Call the shell script within your github repo, followed by your commit message. Note that this is smoother if your git config file is set-up such that Github won’t ask for your username and password every time you push.

Children with genetic risk of autosomal AD have higher plasma abeta levels

Cross-sectional measures of structural and functional MRI and plasma Aβ assays were assessed in 18 PSEN1 E280A carriers and 19 noncarriers aged 9 to 17 years from a Colombian kindred with ADAD… [M]utation-carrying children were distinguished from control individuals by significantly higher plasma Aβ1-42 levels (mean [SD]: carriers, 18.8 [5.1] pg/mL and noncarriers, 13.1 [3.2] pg/mL; P < .001) and Aβ1-42:Aβ1-40 ratios (mean [SD]: carriers, 0.32 [0.06] and noncarriers, 0.21 [0.03]; P < .001).

That’s an extract from the abstract of an interesting study from the June issue of JAMA Neurology.

Notes on segmenting oligodendrocytes in electron microscopy images using Python3 and OpenCV 3.0

Attention Conservation Notice: Not much new here; mostly just notes to myself for future reference. Reading this is unlikely to be a good use of your time, unless you are trying to install OpenCV 3.0 on Python3, in which case, I tremble with sympathy. (In all seriousness, installing it took me about an hour, but I made some trivial mistakes and it could be way quicker.)


Installing OpenCV for Python3

First of all, know that downloading and installing OpenCV (the CV stands for “Computer Vision”) for Python on a MacBook Pro can be pretty time consuming; one commenter I saw online called it “a rite of passage.” After failing to successfully download it for Python 2.7 for about half hour, I eventually decide to use this opportunity to upgrade to Python3.

Installing Python3 is easy:

brew install python3

I then followed Luis González’s useful tutorial for installing OpenCV as a module for Python3. I still ran into one problem, which is that I had a different version of Python. My recommendation here is to manually cd to those directories in your terminal to make sure that the files exist as they are meant to, and then copy and paste the paths into your Cmake GUI. Here is my successful configuration:

Cmake

Specifically, if you are getting the error

fatal error: 
 'Python.h' file not found
#include <Python.h>

(perhaps at 98% completion!), make sure to check that your $PYTHON3_INCLUDE_DIR is set properly; mine had a typo.

How to identify oligodendrocytes in electron microscopy images 

Once I had OpenCV downloaded, I was able to begin actually analyzing some EM images. I’m interested in being able to distinguish oligodendrocytes on electron microscopy from other brain cell types. It turns out this is a pretty difficult problem, and as far as I know, there is no large database of images from which a substantial training set (e.g., > ~ 200 images or so) could be built.

Instead, we can go based on published features of oligodendrocytes, which have been described previously (e.g., herehere, here, here, and here). As far as my novice understanding goes, these are some key features of oligodendrocytes:

  • Smooth, round-to-oval outline
  • Centrally placed, round-to-oval nuclei, (with “nucleoli occasionally seen on the plane of section”)
  • Distinct and dark Golgi apparatus in the cytoplasm
  • Thin processes (when they are visible)

And here are some features distinguishing oligodendrocytes from other cell types:

  • Astrocytes: a) paler (cytosplasm and nuclei), b) glycogen granules, c) filament bundles (due to GFAP), and d) wider processes (if the processes can be identified)
  • Neurons: large and pale nuclei
  • Microglia: more likely to be irregularly shaped, and sometimes have dark, small inclusions
  • NG2 cells: elongated or bean-shaped nuclei, and can contain long endoplasmic reticulum

Here is a picture of an oligodendrocyte from Alan Peter’s helpful website:

Alan Peter's oligodendrocyte EM

Using OpenCV to parse tissue slices 

Here is my code. I tried a few different strategies. My goal is to parse slice out a portion of the image that is recognizable as “the oligodendrocyte,” which can be seen by the human eye as the red portion here (also from Alan Peter’s website):

Peters Colors

1) Canny edge detection. This seems to be “too much” to be useful.

Canny

2) Otsu’s reduction of a greyscale to a binary image. This is actually not bad; maybe it could be stacked with another method?

Otsu

3) Blob detection. This is somewhat promising, but unfortunately the blobs detected are pretty small (too small to be the oligodendrocyte, or even the nucleus), and when I try to make them larger, no blobs are detected.

blobs

So, this is still a work-in-progress, but I wanted to get some notes up.

Reference

The fine structure of the aging brain. Authors: Alan Peters and Claire Folger Sethares. Boston University School of Medicine, 72 East Newton Street, Boston, MA 02118. Website: www.bu.edu/agingbrain. Supported by the Institute on Aging of the National Institute of Health, grant number P 01-AG 000001.

Generating new protein sequences with a character-level recurrent neural network

This past weekend, using Andrej Karpathy’s outrageously simple and helpful github repository [1], I trained a recurrent neural network on my laptop [2].

If you are reading this post in part because you want to do a similar thing, rest assured that by far the most time-consuming part was installing Torch7.

Well, don’t rest totally assured, because that process was actually pretty annoying for me [3].

Anyway, I wanted to train a character-level neural network because I was really impressed by Andrej’s generated Shakespearian sonnets, as well as Talcos’s generated MTG cards.

But, I wanted to train the RNN on something more biological. So as input data I used a 76 MB fasta file of the full set of human protein sequences, which is available for download via ftp here.

As with any fasta file, this is the format of the input data (I know this is boring, but it will become important later):

>gi|53828740|ref|NP_001005484.1| olfactory receptor 4F5 [Homo sapiens]

MVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIVITVVSDSHLHSPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHFFGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHLLFCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIISYTIILMTIQHRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKSLDKFLAVFYSVITPLLNPIIYTLRNKDMKTAIRQLRKWDAHSSVKF

>gi|767901760|ref|XP_011542107.1| PREDICTED: uncharacterized protein LOC102725121 isoform X2 [Homo sapiens]

MSDSINFSHNLGQLLSPPRCVVMPGMPFPSIRSPELQKTTADLDHTLVSVPSVAESLHHPEITFLTAFCLPSFTRSRPLPDRQLHHCLALCPSFALPAGDGVCHGPGLQGSCYKGETQESVESRVLPGPRHRH

The lines with carets contain metadata about the following protein sequence.

The protein sequences themselves are made up of capital letters which refer to one of the twenty main amino acids that are coded for by the human genome.

After changing to the directory that I cloned Andrej’s github directory into, I trained the neural network using the following call:

th train.lua -data_dir data/sapiens_fasta/ -gpuid -1 -max_epochs 1 -eval_val_every 40

One epoch means that the neural network is run once through the input file, which almost certainly not enough times through the data for it to learn truly useful trends, but I was limited on CPU time, and running it through just once took almost 40 hours.

This RNN has no structural priors and certainly doesn’t know English, so it has to learn everything fresh from the data.

After 1/100 through the epoch, the RNN was struggling to learn basic aspects of the file format:

ch692kennaR]iENIGCFGGPRGKFAQKNRPEKEALRLKPGTLRRPHRKGQIIEEGYPSCSKAKELDKAGVKRVQVLLLPNDGGLKQTVFCLLYELKPRDLSSPGLKHGHKQQTTFLLQFNTPFQLTEAGSVFEELNECLVDPVRY>LPGMLDGDSQCGLQVMEIIVTTRHIAEGIKMRMYHPRANAQTDGKSFIHTLAQMCFDVGLQETGMSTMIQGSHTEKTAHLMLNNLG

VYQVHELGKLGVGYMFRNLSHIPRTDEPYTHYSAAK

IEDEACNGEHGKRVCLYDGLLLSGENHGEGITRKRLEPLPQPRSPGESTLGVIVMATKQVRLEVH>DRgN|6P61|725902|Ur9r0||PT_f120145573o.r heCn r1aMthrffcenaR o rara-op2oninier1 oisceeo oo s dosrXs26mnccmagosf e]giM:mnpo1lnsrgoase poraom/oifoo rissl,]MHDTTEDVAAEYLDVVCLYMWYCVVSYQVQIDNLCKQCEDAGRKKHLNFALFDKDNSKAKVKHSVEAVGHNVVDASSAPVPYYAGSIDLQPVGVREACEQ

After 1/10 through the epoch, the RNN had the format mostly down, and had even learned that “XP” came before a predicted protein, while “NP” came from known (e.g., cDNA sequence) protein data.

>gi|670774209|ref|XP_011526108.1| PREDICTED: neinator uxtylcerin-1 isoform X1 [Homo sapiens]

MAAATGSKPRRRKRPRDRNDSLPPCKSQRGATSPSWPSLKCASVALKTKSTQLKTSDNSPQLPAIKLHIGLPREPFQETLAVELQGKLQPTPQQVLAYRDQ

>gi|115370400|ref|NP_001185001.1| oimh meeroribh CAE3 pomolphinase CA-ase popotecusution maclor domaiongbnating protein 2Hisoform 309 isoform X2 [Homo sapiens]

MAFPTVHWILSTCSAAHEAEAALEAEEFRTALYDGLNLSGFPHIFHKTRKRLFQKNRPRRPGFSTLGTIVMATIPTRLGVIADLRGWFRKDYQFKTCFYRRGPATVGLQVKQADLPPQKTARQQFSAVFKFVLSKIHKPHGCTAVFTFCRSERRLRPKTDARIVFIIRARPAVAGQVTDVDDLNGGNFKRVKDKITRFELLSRCLNTTKRETG

As you can see, it had also learned that (almost) all protein sequences start with an “M”, since methionine is coded for by the start codon, AUG, which the ribosome recognizes.

Finally, by the end of the epoch, the RNN has the format down and was predicting some protein names that are kind of in the “uncanny valley”:

>gi|38956652|ref|NP_001018931.1| hamine/transmerabulyryl-depenter protein 1 isoform 2 [Homo sapiens]

MVTNTCPLDAKGLQKLNTSEREELESCIERLQISQDAQNRMGRWAIRDELNFRHGGEAGEPVQENFGGVRAYFFCSPEQDGIRNNVEEFVESAGWILNPSQADFRSILSKSTKISLVGLAGLYGFPQGARASFVAQHEDVSRVVVFPLQAVSYSEEKRHSGAIEDLLPLEFRPVGVGML

>gi|767959205|ref|XP_011517870.1| PREDICTED: D3 glutamate channel protein isoform X5 [Homo sapiens]

MESATDSSMSPATLYEEPSPCTPSRQKAKSPFQKQRRGSQQLNKHREGEEQQALLNEGLKQVEQAFSIVTKRKQGLLNREALKKKQAQKLASESNQLNVLLKDLGEIKDKISFLKNSFDSGTNVTGEKDSGEGFERCTPDPIDPTPDREMPRQGADVVMEMGETHRFLWAHADEVKLSYVGGGRIKVQSYKREIVALVVIEP

This RNN has a particular penchant for combining parts of names, and some of these actually make sense, like “receptorogen”, or “elongatase.”

I blasted ~10 of the protein sequences trained on the full epoch to see whether they had any evolutionary conservation, but none of them had any conservation above chance, suggesting that the RNN isn’t just repeating protein sequences.

I also did structure predictions on one of the generated protein sequences, and it is made up of one protein domain with a good template from the Protein Data Bank (PDB).

rnn pdb Here is what the protein is predicted to look like:

protein

The arrows are a common way of referring to alpha helix secondary structure, which the generated protein has a reasonable amount of (31%; the average globular protein contains 30%).

It’s interesting to think about what applying RNNs on protein sequences or other sorts of biological data might accomplish.

For example, you could potentially feed the RNN a list of many anti-microbial proteins as training data, to try to generate new peptides that you could test as novel antibiotics.


[1]: See also this post where Andrej explains it in more detail and uses it to predict Shakespeare sonnets.

[2]:  A non-CUDA compatible MacBook Pro.

[3]: How to do this on a MacBook pro that is non-CUDA compatible: a) install/update homebrew, b) install/update lua, c) follow the command-line instructions at the torch7 website to install it, d) run source ~/.profile to load torch7 at the terminal (I ignored this part of the instructions and this is part of what made it take so long for me), e) get the necessary luarocks, f) fork/clone Andrej’s repo, and g) run the nn training and sampling commands with the -gpuid -1 option, since your machine is non-CUDA compatible (I also ignored this part of the instructions, to my vexation).


References

Morten Källberg, Haipeng Wang, Sheng Wang, Jian Peng, Zhiyong Wang, Hui Lu, and Jinbo Xu. Template-based protein structure modeling using the RaptorX web server. Nature Protocols 7, 1511-1522, 2012.

Kaparthy’s github repository: Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch. 2015.

Telomeres: something to fear, or merely a seer?

Attention Conservation Notice: 1200+ words, with an estimated read time of seven minutes, of exuberant extrapolation from a new study which I was not associated with in any way. I am not an expert in telemore biology, nor am I a doctor.


In case you’ve been busy obsessing over whether or not to post things to social media over the past decade (oh wait, that’s me), you may not know that much about telomeres.

Telomeres are repetitive sequences of DNA at the end of chromosomes that get shorter with each cell division. If they get too short, they can promote cellular senescence.

They also are responsible for some of the prettiest pictures of DNA, in which people stain for the telomere “cap” at the end of the chromosomes, such as this classic image:

from the US DoE HGP via Wikipedia User:Gustavocarra

from the US DoE HGP via Wikipedia User:Gustavocarra

Much of the extrapolation of telomeres to aging is built around “Hayflick limit” models of aging.

However, these models do not explain all aspects of aging, and in fact, in most organ systems, cells do not die in large numbers during normal aging (a key exceptions being the thymus) [1]. Instead, cell numbers usually remain constant, because most cells in the body are post-mitotic.

Nevertheless, average telomere length has emerged as a very promising biomarker for aging, and a study on 60,000+ Danish people came out a few weeks ago that gives us a lot of new data on the subject. As far as I can tell, it is the largest such study to date.

Associating telomere length with mortality 

This study measured average telomere lengths in white blood cells (leukocytes) from peripheral blood samples.

They then measured death rates in a 0-22-year follow-up period (with a median follow-up length of 7 years).

Adjusting for age only, the individuals in the shortest decile of telomere length had an increased risk of death of 1.54 (95% CI 1.38 to 1.73).

And even after adjusting for age, sex, BMI, systolic BP, smoking status, tobacco consumption, alcohol consumption, physical activity, and cholesterol level, individuals in the shortest decile of telomere length still had an increased risk of death of 1.40 (95% CI = 1.25 to 1.57).

This shows that telomeres are, at the very least, a biomarker for aspects of aging other than those seen by these traditional clinical parameters.

It’s also possible to go further say “we corrected for nearly everything and still found an effect of telomeres on mortality, so there must be something causal” but that’s fraught with problems — the biomedical literature is truly littered with effects seen in correlational studies that didn’t replicate in causal ones — ibuprofen for Alzheimer’s disease, estrogen replacement therapy, etc.

This doesn’t mean it’s wrong, of course, but that the reference class probability is not as high as you might think.

Luckily, the genetic data in their study offers a way to address this question directly.

Genetic variants of telomere length as a window into causality

The authors genotyped participants for three SNPs that affect telomere length, one associated with each of the following genes:

  • Telomerase reverse transcriptase (TERT)
  • Telomerase RNA Component (TERC)
  • A gene that helps replicate and cap telomeres (OBFC1)

Each of these SNPs has two major alleles, which means that there are three possible states for each individual at each of the SNPs: a) no telomere-shortening alleles, b) one telomere-shortening allele, or c) two telomere-shortening alleles.

Since there are three total SNPs and two alleles at each, there are six total telomere-shortening alleles that each individual could have.

When the authors built a linear “score” from those SNPs, they found that it had a very strong effect on telomere length. I made a visualization of their raw data to show this:

Telomere length versus the sum of telomere-shortening alleles in each individual (allele score), +/-  the standard error; data from doi: 10.1093/jnci/djv074

Telomere length versus the sum of telomere-shortening alleles in each individual (allele score), +/- the standard error; data from doi: 10.1093/jnci/djv074; code

Under the assumption that these alleles don’t influence the aging process in any other way, this allele score offers a tremendous “natural experiment” into the causal role of average telomere length in blood leukocytes in humans.

What the authors found using this natural experiment is that telomere-shortening alleles led to a decrease in the risk of cancer (OR of 0.95 +/- 0.04 per telomere-shortening allele), but had no effect on all-cause mortality (OR = 0.99 +/- 0.02 per telomere-shortening allele).

One suggested mechanism for the cancer effect is that shorter telomeres give potentially cancerous cells a shorter “fuse” before they become senescent. So, people with shorter telomeres may be less likely to develop an actual malignancy.

So, my conclusion is that, despite the correlation seen in the section above, average telomere length likely does not play a causal role in age-related mortality in humans [2]. A few possible critiques of this conclusion:

1) The study not be well-powered enough to detect a difference in all-cause mortality. Some possible solutions here would be to a) increase sample size or b) do studies to find more variants affecting telomere length and then re-do the analysis with increased association power. But, needing a sample size > 60,000 to find an effect would mean that any effect that you did find would be very small.

2) Maybe it is not the average telomere length but rather some other property of telomere length distribution (such as the proportion of cells below or above some “critical threshold”) that is the relevant parameter. This is possible, but biology in generally doesn’t operate on such threshold mechanisms.

3) Josh Mitteldorf, if I understand him correctly, suggests that these three SNPs may affect telomere length only later in life, but not earlier in life. Although I don’t know the actual argument, I suppose it’s possible that telomere lengths early in life play the major causal role in later aging rates, and that these SNPs don’t affect telomere length until later in life. One possible solution here could be a longitudinal study of telomere length, or a cross-sectional study in infancy/childhood. Still, I don’t find either aspect of the argument here very likely.

Conclusions

Genetic evidence suggests that average telomere length (in blood leukocytes, in a Danish population) likely does not play a causal role, or at least does not play a strong causal role, in promoting aging.

That said, average telomere length in blood leukocytes does appear to be an excellent biomarker for aging, capturing large effects not accounted for by measuring traditional clinical factors that affect aging, such as smoking status or cholesterol levels.

This bolsters the evidence for Bojeson’s argument that telomere length should be used as a biomarker for aging and for assessing risk of age-related diseases in routine clinical practice.

References

Rode L, Nordestgaard BG, Bojesen SE. Peripheral blood leukocyte telomere length and mortality among 64 637 individuals from the general population. J Natl Cancer Inst. 2015;107(6):djv074.

Mitteldorf J, “Large New Survey Tracks Telomere Length and Mortality”. 2015. http://www.webcitation.org/6YYKb3OWm

Bojesen SE. Telomeres and human health. J Intern Med. 2013;274(5):399-413.


[1]: What about the brain, do I hear you say? A key tenet of the Alzheimer’s disease model is that neuronal loss in aging is not normal and only occurs in Alzheimer’s disease.

[2]: Let me be quite clear and point out the obvious that I agree this is a bad thing. That is, it would be great if telomere length were causal for aging, because it would suggest an obvious intervention to decrease the risk of age-related diseases: therapies to extend telomere length. Unfortunately, this study suggests that such an intervention will probably not have a beneficial effect on age-related mortality.

How to run kallisto on NCBI SRA RNA-Seq data for differential expression using the mac terminal

Attention Conservation Notice: This post explains how to run the exceptionally fast RNA-seq k-mer aligner kallisto from the Pachter lab on data you download from NCBI’s Short Read Archive, and then analyze it for differential expression using voom/limma. As with everything in bioinformatics, this will likely be obsolete in months, if not weeks.


Kallisto is really fast and non-memory intensive, without sacrificing accuracy (at least, according to their paper), and therefore has the potential to make your life a lot easier when it comes to analyzing RNA-seq data.

As a test data set, I used the very useful SRA DNA Nexus to search the SRA database for a transcriptome study from human samples with 2-3 biological replicates in 2 groups, so that I could achieve more robust differential expression calls without having to download too much data.

I ended up using SRP006900, which is titled “RNA-seq of two paired normal and cancer tissues in two stage III colorectal cancer patients.” I actually wasn’t able to find a study describing it for differential expression analysis in a 3-5 min search, but since it was public almost four years ago, I’m sure that it has been analyzed in such a manner. It is single-end reads with an average of 65 base pairs per read.

In order to download this data set (which totals ~ 10 GB over its four files, in .fastq format) from the SRA via the command line, I used the SRA toolkit. You’ll want to make sure this is downloaded correctly by running a test command such as the following on the command line:

$PATH/sratoolkit.2.5.0-mac64/bin/fastq-dump -X 5 -Z SRR390728 

You’ll also want to download kallisto. In order to install it, I copied it to my usr/local/bin/ folder to make it globally executable, as described on the kallisto website.

You also need to download the fasta file for your transcriptome of interest and create an index file that kallisto can work on with. You can download the hg38 reference transcriptome from the kallisto website or from UCSC.

For pre-reqs, it is helpful to have homebrew installed as a package manager. If you do (or once you do), these three commands did the trick for me:

brew update
brew install cmake
brew install homebrew/science/hdf5

Next you’ll need to actually download the data from each run, convert them to .fastq, and then run kallisto on them. For extensibility, I did this in the following shell script.

First, a config file with the sample names and average read length (will be needed for kallisto if the data is single-end, in my experience):

declare -a arr=("SRR222175" "SRR222176" "SRR222177" "SRR222178")
AVG_READ_LENGTH=65

(Update 5/13/15: The average read length should be average fragment length, which I’m working on figuring out the best way of estimating from single-end reads (if one is possible). In the meantime, it may be easier to choose a paired-end read sample from SRA instead, from which the average fragment length can be estimated by kallisto. Thanks to Hani Goodarzi for the helpful pointer that I was using the wrong parameter.)

Next, I use a shell script to execute the relevant commands on these sample names to build your abundance.txt files. Note that this data set has single-end reads, so you only call one .fastq file, rather than the two in their example. I put each output in a file called output_$RUNNAME that I can call from R for downstream analyses. Here’s the full shell script (also github):

#path to where the SRA bin folder is located; could eliminate by making SRAT globally executable 
PATH_SRA_DIR="your_path_here/"

#go to where your kallisto test directory is found, if necessary
echo "cd your_path_here"

#load the config file
source test_kallisto/config_SRP006900.file

#download the data 
for i in "${arr[@]}"
do
   "$PATH_SRA_DIR"sratoolkit.2.5.0-mac64/bin/fastq-dump --accession $i --outdir test_kallisto
done

#create the intranscripts index file 
kallisto index -i transcripts.idx transcripts.fasta.gz

#create output directories, and then run kallisto for each of the runs
for i in "${arr[@]}"
do
	mkdir output_"$i"
	kallisto quant -i transcripts.idx -o output_"$i" -l "$AVG_READ_LENGTH" "$i"".fastq"
done

Once you have these files, you can analyze them in R using limma. Here’s how I went about it. First, read in the files and merge the transcripts per million (tpm) calls into one data frame:

library(limma)

#samples names of downloaded from SRA 
sample_list_n = c("SRR222175", "SRR222176", "SRR222177", "SRR222178")

for(i in 1:length(sample_list_n)){
	tmp = read.table(file = paste0("test_kallisto/output_", 
		sample_list_n[i], "/abundance.txt"), header = TRUE) 
	assign(sample_list_n[i], tmp)
}

sample_list = mget(sample_list_n)\

#give the list unique names 
sample_list_uni = Map(function(x, i) setNames(x, ifelse(names(x) %in% "target_id",
      names(x), sprintf('%s.%d', names(x), i))), sample_list, seq_along(sample_list))

full_kalli = Reduce(function(...) merge(..., by = "target_id", all=T), sample_list_uni)

tpm_vals = full_kalli[, grep("tpm", names(full_kalli))]
rownames(tpm_vals) = full_kalli$target_id

Then make a contrasts matrix for normal vs control samples:

groups = c("normal", "colon_cancer", "normal", "colon_cancer")
condition &lt;- model.matrix(~0 + groups)
colnames(condition) = c("normal", "colon_cancer")
cont_matrix = makeContrasts(norm_canc = normal - colon_cancer, levels = condition)

Then compute differential expression using that contrast matrix:

#this spits out an EList object 
v = voom(counts = tpm_vals, design = condition)
fit = lmFit(v, condition)
fit = contrasts.fit(fit, cont_matrix)
fit = eBayes(fit)
top_table = topTable(fit, n = 10000, sort.by = "p")

And that’s it. Interestingly, the transcript called as the #1 most differentially expressed using this method, ATP5A1, has been previously implicated as a marker for colon cancer.

Update 5/13/15: Fixed two typos, switched one link, and switched “genome” to “transcriptome” (I was sloppy previously) per @pmelsted’s useful comment on twitter.

Update 8/10/15: As Seb Battaglia points out on twitter, “voom assumes you cut out out 0s and low intensity tpm. I’d add that step or data hist() will be skewed.” I don’t have the opportunity to re-run the analyses right now (!), but please take this into consideration.

Parkinson’s GWAS improves when subclassifying patients by pathology, instead of clinical traits

Simón-Sánchez and Gasser discuss this in a stimulating article in the most recent issue of Neurology:

Beecham et al [2015] argue that one of the reasons for this “missing heritability” may be the uncontrolled degree of heterogeneity of [Parkinson’s Disease (PD)] in studies relying on clinical diagnosis alone. To reduce heterogeneity, they performed a GWAS in which only PD cases with autopsy-confirmed Lewy body (LB) pathology, and controls with exclusion of PD neuropathology, were included. Despite the relatively small number of samples used, they found evidence suggesting that common variation in a small region of chromosome 1p32 is associated with LB PD with a p-value just below genome-wide significance level and an odds ratio of 0.64 for the protective allele, which is comparable with that found for SNCA and MAPT in other studies. This association peak lies within PARK10, which was originally identified in a set of Icelandic families.

This “splitting” approach is likely to be valuable in other neurodegenerative diseases, as well.

Age-related changes in white matter are likely not due to demyelination?

That’s the conclusion of an important new paper from Billiet et al that integrated a bunch of different imaging modalities.

First, they used DTI to measure fractional anisotropy, which is a non-specific measure of white matter, measuring axon density, axon diameter, and myelin content. They also used myelin water fraction (MWF) to measure myelin content and the orientation dispersion index (ODI) to measure dendrite/axon dispersion. These were their findings:

  • Total white matter volume does not change in aging
  • Myelin content (via MWF) is not significantly altered in aging in most regions, and in the regions where there is a change, it increases
  • There is a widespread decrease in fractional anisotropy in aging that is especially strong in frontal regions
  • Changes in myelin content (MWF) do not correlate well with this decrease in fractional anisotropy
  • Changes in neurite orientation (ODI) dispersion do correlate with this decrease in fractional anisotropy

Thus, their conclusion is that age-related decreases in the “diffusibility” of white matter (or whatever fractional anisotropy is measuring) are due to changes in axons rather than to changes in myelin.

It’s all cross-sectional, n = 59, and they call this conclusion “highly speculative,” but if true it suggests that axons changes are more early/causal/fundamental to aging than changes in myelin. We need more data on this, especially from older individuals (their study went up to only 70) and from people with dementias, such as Alzheimer’s disease.


Billiet T, Vandenbulcke M, Mädler B, et al. Age-related microstructural differences quantified using myelin water imaging and advanced diffusion MRI. Neurobiol Aging. 2015