Twelve Interesting Recent Papers

1) Wootla et al. discussing naturally occurring antibodies for treatment of CNS disorders. Naturally occurring antibodies are mainly IgM and bind to many different types of antigens with low affinity (that’s what happens when you don’t do any affinity maturation). One idea is that elderly people without AD (but with, say, risk factors such as APOE) may have more of these antibodies, that help clear amyloid, and that’s why they haven’t developed AD. In fact, one of the more promising current treatments for AD in trials, aducanumab, was originally derived from elderly donors without AD based on this hypothesis. A similar procedure is also being done in MS — e.g., the authors describe some antibodies that bind specifically to oligodendrocytes with the goal of promoting remyelination.

2) Cummings et al. describing good phase II trial results for dextromethorphan + quinidine for agitation in AD. Aside from being excited about a potential new treatment for an aspect of AD, I find this particularly interesting since I was previously involved in a project that evaluated the effects of recreational doses of DXM in the comments of YouTube videos. However, the recreational doses are much higher than the doses in this study (> 200 mg vs 30 mg, respectively), so the effects are probably radically different — as always, the dose makes the poison.

3) A couple of papers recently came out purporting to explain the role of the ApoE risk variant in AD, which is very important but still very much unknown. First, a really interesting paper from Zhu et al. shows that in APOE ɛ4 carriers, synj1 expression increases, which decreases the expression of phospholipids such as PIP2. This is similar to an ApoE-null phenotype, suggesting a loss of function phenotype. Second, Cudaback et al. show that ApoE allele status affects the astrocyte secretion of the microglial chemotaxis factor CCl3. Interestingly, the ɛ4 and ɛ2 alleles have a more similar effect than ɛ3 in their data.

4) Turner et al. present results from an RCT of resveratrol for AD, which finds some good effects in biomarkers, but is not a home run clinically. Although with only 119 participants, it is likely underpowered, and one of the four clinical measures had a p = 0.03 effect in the correct direction.

5) Tom Fagan at AlzForum does nice reporting on results from PET and neuropathology showing that, by both measures, around 25% of people clinically diagnosed with AD do not have high amyloid levels. This is higher in ApoE e4 non-carriers, which is what you’d expect based on conditional probability and clinicians not taking into account ApoE allele status into account when making their diagnosis. In the absence of amyloid, neurodegeneration appears to be fairly slow or absent.

6) Dale Bredesen continues his innovative work in AD, describing here case reports suggesting that there are three types of AD, one inflammatory, one metabolic (e.g., related to insulin resistance), and one related to zinc deficiency.

7) Moran et al. use ADNI data to show that Type 2 Diabetes is associated with CSF tau (explaining 15% of the T2DM-associated cortical thickness loss), but not CSF amyloid, suggesting that T2DM might be related to tau-only AD cases, and/or tau increases that are independent of amyloid.

8) Not AD, but still neurological, in frontotemporal dementia, Ahmed et al. report that fasting blood levels of agouti-related peptide (AgRP) are much higher in patients (~66.5 +/- 85) than in controls (~23 +/- 20). Furher, AgRP levels are correlated with BMI, suggesting that AgRP levels account for the increased eating behavior seen in some variants of FTD. Just interesting to see an example where the effect of hormones on eating behavior could be very strong.

9) Petrovski et al. used WGS data to define an interesting measure of “how tolerant a gene’s regulatory region has been to mutation across evolution.” Specifically, their measure (the “noncoding Residual Variation Intolerance Score”) measures how many common variants a gene has in its regulatory region compared to other genes with a similar mutation rate. They found that higher levels of this measure were significantly associated with genes that are annotated as haploinsufficient, meaning that this is a good way of describing how much cells care about what relative expression levels a gene has.

10) Zheng et al. also used WGS data and found that rare variants in the gene EN1 are significantly associated with the risk of bone fracture. To quantify the effects of rare variants (< 5% MAF) they also used an association test — SKAT — to measure associations of these variants with bone marrow density in windows of 30 bp’s, and found one significant gene with this procedure. Refreshingly they put their code for this analysis online, available here, I haven’t ran it but just want to say +1 to them for putting their wrapper code online. Interestingly, both this paper and the Petrovski paper use the GERP++ score for their evolutionary inference — that seems to be a common tool, check it out here.

11) In influenza news, Lakdawala et al. show that influenza A does a large amount (most?) of its replication in the soft palate, which is the fleshy, soft part in the back of your mouth. Total hindsight bias, but this “makes sense” to me when I think back to the times when I think I had the flu myself — that part of my mouth gets very irritated, and now this makes slightly more sense.

Generating new protein sequences with a character-level recurrent neural network

This past weekend, using Andrej Karpathy’s outrageously simple and helpful github repository [1], I trained a recurrent neural network on my laptop [2].

If you are reading this post in part because you want to do a similar thing, rest assured that by far the most time-consuming part was installing Torch7.

Well, don’t rest totally assured, because that process was actually pretty annoying for me [3].

Anyway, I wanted to train a character-level neural network because I was really impressed by Andrej’s generated Shakespearian sonnets, as well as Talcos’s generated MTG cards.

But, I wanted to train the RNN on something more biological. So as input data I used a 76 MB fasta file of the full set of human protein sequences, which is available for download via ftp here.

As with any fasta file, this is the format of the input data (I know this is boring, but it will become important later):

>gi|53828740|ref|NP_001005484.1| olfactory receptor 4F5 [Homo sapiens]

MVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIVITVVSDSHLHSPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHFFGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHLLFCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIISYTIILMTIQHRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKSLDKFLAVFYSVITPLLNPIIYTLRNKDMKTAIRQLRKWDAHSSVKF

>gi|767901760|ref|XP_011542107.1| PREDICTED: uncharacterized protein LOC102725121 isoform X2 [Homo sapiens]

MSDSINFSHNLGQLLSPPRCVVMPGMPFPSIRSPELQKTTADLDHTLVSVPSVAESLHHPEITFLTAFCLPSFTRSRPLPDRQLHHCLALCPSFALPAGDGVCHGPGLQGSCYKGETQESVESRVLPGPRHRH

The lines with carets contain metadata about the following protein sequence.

The protein sequences themselves are made up of capital letters which refer to one of the twenty main amino acids that are coded for by the human genome.

After changing to the directory that I cloned Andrej’s github directory into, I trained the neural network using the following call:

th train.lua -data_dir data/sapiens_fasta/ -gpuid -1 -max_epochs 1 -eval_val_every 40

One epoch means that the neural network is run once through the input file, which almost certainly not enough times through the data for it to learn truly useful trends, but I was limited on CPU time, and running it through just once took almost 40 hours.

This RNN has no structural priors and certainly doesn’t know English, so it has to learn everything fresh from the data.

After 1/100 through the epoch, the RNN was struggling to learn basic aspects of the file format:

ch692kennaR]iENIGCFGGPRGKFAQKNRPEKEALRLKPGTLRRPHRKGQIIEEGYPSCSKAKELDKAGVKRVQVLLLPNDGGLKQTVFCLLYELKPRDLSSPGLKHGHKQQTTFLLQFNTPFQLTEAGSVFEELNECLVDPVRY>LPGMLDGDSQCGLQVMEIIVTTRHIAEGIKMRMYHPRANAQTDGKSFIHTLAQMCFDVGLQETGMSTMIQGSHTEKTAHLMLNNLG

VYQVHELGKLGVGYMFRNLSHIPRTDEPYTHYSAAK

IEDEACNGEHGKRVCLYDGLLLSGENHGEGITRKRLEPLPQPRSPGESTLGVIVMATKQVRLEVH>DRgN|6P61|725902|Ur9r0||PT_f120145573o.r heCn r1aMthrffcenaR o rara-op2oninier1 oisceeo oo s dosrXs26mnccmagosf e]giM:mnpo1lnsrgoase poraom/oifoo rissl,]MHDTTEDVAAEYLDVVCLYMWYCVVSYQVQIDNLCKQCEDAGRKKHLNFALFDKDNSKAKVKHSVEAVGHNVVDASSAPVPYYAGSIDLQPVGVREACEQ

After 1/10 through the epoch, the RNN had the format mostly down, and had even learned that “XP” came before a predicted protein, while “NP” came from known (e.g., cDNA sequence) protein data.

>gi|670774209|ref|XP_011526108.1| PREDICTED: neinator uxtylcerin-1 isoform X1 [Homo sapiens]

MAAATGSKPRRRKRPRDRNDSLPPCKSQRGATSPSWPSLKCASVALKTKSTQLKTSDNSPQLPAIKLHIGLPREPFQETLAVELQGKLQPTPQQVLAYRDQ

>gi|115370400|ref|NP_001185001.1| oimh meeroribh CAE3 pomolphinase CA-ase popotecusution maclor domaiongbnating protein 2Hisoform 309 isoform X2 [Homo sapiens]

MAFPTVHWILSTCSAAHEAEAALEAEEFRTALYDGLNLSGFPHIFHKTRKRLFQKNRPRRPGFSTLGTIVMATIPTRLGVIADLRGWFRKDYQFKTCFYRRGPATVGLQVKQADLPPQKTARQQFSAVFKFVLSKIHKPHGCTAVFTFCRSERRLRPKTDARIVFIIRARPAVAGQVTDVDDLNGGNFKRVKDKITRFELLSRCLNTTKRETG

As you can see, it had also learned that (almost) all protein sequences start with an “M”, since methionine is coded for by the start codon, AUG, which the ribosome recognizes.

Finally, by the end of the epoch, the RNN has the format down and was predicting some protein names that are kind of in the “uncanny valley”:

>gi|38956652|ref|NP_001018931.1| hamine/transmerabulyryl-depenter protein 1 isoform 2 [Homo sapiens]

MVTNTCPLDAKGLQKLNTSEREELESCIERLQISQDAQNRMGRWAIRDELNFRHGGEAGEPVQENFGGVRAYFFCSPEQDGIRNNVEEFVESAGWILNPSQADFRSILSKSTKISLVGLAGLYGFPQGARASFVAQHEDVSRVVVFPLQAVSYSEEKRHSGAIEDLLPLEFRPVGVGML

>gi|767959205|ref|XP_011517870.1| PREDICTED: D3 glutamate channel protein isoform X5 [Homo sapiens]

MESATDSSMSPATLYEEPSPCTPSRQKAKSPFQKQRRGSQQLNKHREGEEQQALLNEGLKQVEQAFSIVTKRKQGLLNREALKKKQAQKLASESNQLNVLLKDLGEIKDKISFLKNSFDSGTNVTGEKDSGEGFERCTPDPIDPTPDREMPRQGADVVMEMGETHRFLWAHADEVKLSYVGGGRIKVQSYKREIVALVVIEP

This RNN has a particular penchant for combining parts of names, and some of these actually make sense, like “receptorogen”, or “elongatase.”

I blasted ~10 of the protein sequences trained on the full epoch to see whether they had any evolutionary conservation, but none of them had any conservation above chance, suggesting that the RNN isn’t just repeating protein sequences.

I also did structure predictions on one of the generated protein sequences, and it is made up of one protein domain with a good template from the Protein Data Bank (PDB).

rnn pdb Here is what the protein is predicted to look like:

protein

The arrows are a common way of referring to alpha helix secondary structure, which the generated protein has a reasonable amount of (31%; the average globular protein contains 30%).

It’s interesting to think about what applying RNNs on protein sequences or other sorts of biological data might accomplish.

For example, you could potentially feed the RNN a list of many anti-microbial proteins as training data, to try to generate new peptides that you could test as novel antibiotics.


[1]: See also this post where Andrej explains it in more detail and uses it to predict Shakespeare sonnets.

[2]:  A non-CUDA compatible MacBook Pro.

[3]: How to do this on a MacBook pro that is non-CUDA compatible: a) install/update homebrew, b) install/update lua, c) follow the command-line instructions at the torch7 website to install it, d) run source ~/.profile to load torch7 at the terminal (I ignored this part of the instructions and this is part of what made it take so long for me), e) get the necessary luarocks, f) fork/clone Andrej’s repo, and g) run the nn training and sampling commands with the -gpuid -1 option, since your machine is non-CUDA compatible (I also ignored this part of the instructions, to my vexation).


References

Morten Källberg, Haipeng Wang, Sheng Wang, Jian Peng, Zhiyong Wang, Hui Lu, and Jinbo Xu. Template-based protein structure modeling using the RaptorX web server. Nature Protocols 7, 1511-1522, 2012.

Kaparthy’s github repository: Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch. 2015.

A Mock Case of Neonatal Meningitis

Attention Conservation Notice: 1138 words using stiflingly simple computations to go through an example of finding the cause of made-up case of a medical condition that is unlikely to ever affect you personally. Also, I am not a doctor.

Imagine that a one and a half week year old girl comes in to the hospital with a four day illness consisting of cyanosis, pain unresponsiveness, neck tenderness, and temperature instability. You perform a lumbar puncture and the CSF has elevated neutrophils and decreased glucose, suggestive of bacterial invasion. Which organism is most likely to be causing the infection?

In order to calculate prior probabilities, let’s use this data set of empirical frequencies from recent years. The most likely cause is Group B Strep (GBS; p = 0.46), followed by E. coli (p = 0.15).

Below, the prior probabilities that each organism in the data set will cause the infection are shown in a bubble chart using ggplot2. The probability is proportional to the area (not the radius) of the circle. The x axis denotes the typical gram stain of that organism; Acinetobacter and Serratia are considered gram variable.

The textbook empirical regimen for neonatal meningitis is cefotaxime and ampicillin. Most of the possible bacterial causes of neonatal meningitis are susceptible to cefotaxime, while ampicillin is used to treat Listeria and Enterococus.

It is desirable to reduce the antibiotic spectrum as much as possible, so as soon as you know that the agent causing your infection is susceptible to one of the two antibiotics, the administration of the other is typically stopped.

The y-axis, therefore, denotes the resistance of each of the bacteria to cefotaxime, calculated using this pdf data set. Note that Acientobacter, Enterococcus, and Listeria are actually “off the charts” insofar as they don’t even have their susceptibility determined as a MIC. Their values are set to an artibrarily high value for visualization purposes.

Prior Probabilities For Neonatal Meningitis

prior probabilities for neonatal meningitis

Gram Stain Test

In order to narrow down the possibilities, we will first perform a gram stain of the CSF. Imagine that this is the result (modified from here):

gram positive bugs under the microscope

gram positive bugs under the microscope

In addition to the color indicating gram positive bacteria, this is informative because it allows us to evaluate the morphology of the bacteria. Among the gram positive organisms, S. pneumo, GBS, S. aureus, and Enterococcus are all cocci, while Listeria is a rod. Among the gram variable organisms, Acinetobacter are coccobaccili and Serratia are rods.

Theoretically, we could distinguish between S. pneumo, GBS, S. aureus, and Enteroccocus on the basis of how the cocci are distributed within the slide (i.e., in pairs, chains, or clusters), but that type of information is slightly more challenging to put into quantitative form given our currently available data and technology, and it won’t be as informative.

Gram stain sensitivities and specificities (e.g., for S. pneumo) are each about 97.5%, which corresponds to a likelihood ratio of 39 for each of the positive cases and a likelihood ratio of 0.026 for the negative cases.

So, multiplying the likelihood that each bug is causing the infectious by its prior probability gives us the posterior probability; these are plotted for each agent below, and probabilities less than 0.01 are not shown.

probabilities after staining shows gram positive cocci

probabilities after staining shows gram positive cocci

In terms of case management, the probability of Listeria has dropped significantly, but we must still keep our hypothetical patient on ampicillin because of the possibility of Enterococcal infection, which is consistent with the gram stain and has therefore increased in probability.

Hemolysis Test

The next test that we perform is to streak the bacteria on sheep blood agar and see whether they demonstrate beta hemolysis. Imagine that this is the result:

weak beta-hemolysis on sheep blood agar

beta-hemolysis on sheep blood agar

Although it is slightly weak, the region in the middle of the plate from which bacteria has been removed demonstrates beta-hemolysis. About 33% of Enterococci, 99%+ of S. aureus, 99%+ of GBS, and 1% or less of S. pneumo would show beta-hemolysis. (S. pneumo classically shows alpha hemolysis.) 
We can convert these proportions into likelihoods, use the posterior probability of the last test as the prior for this test, and multiply the prior times the likelihood to get the new posterior. The results are below: 
probabilities following beta hemolysis on sheep blood agar

probabilities following beta hemolysis on sheep blood agar

Although Enterococcus is now less likely to be the bug because a smaller proportion of its typical clinical isolates show beta-hemolysis, it is still very possible, and the patient should remain on the course of ampicillin.

Lancefield Test

At this point, we pull out one of our big diagnostic guns to really narrow down the possibilities: the Lancefield antigen test. Specifically, we’ll use the BBL Streptocard Acid Latex Test (pdf). Here is the result of that test, with  Lancefield antigens A, B, and C in the first three positions of the upper row from left to right:

Lancefield

Lancefield test shows agglutination when group B antigen is added (#2)

The test has a sensitivity of 98% and a specificity of 99% (well they claim 100%, but it’s a small sample size), which yields the high likelihood ratio of 98.

As a result of the positive result on this test and none of the other antigens, Group B Strep is much more likely to be the cause of meningitis in this patient; this is reflected in its dominance of the probability mass below.

probabilities after positive test for Lancefield antigen B

probabilities after positive test for Lancefield antigen B

At this point, you could probably take the patient off of ampicillin, since GBS is susceptible to cefotaxime.

Limitations:

  • In reality, most of the error in each of the tests is likely due to contamination, and since this same contamination might affect all three of the tests in the same way, it is somewhat foolhardy to assume that the tests are independent, as we have. However, a) this type of correlation data between tests is very difficult if not impossible to find, and b) the positive result on the Lancefield antigen test in particular is unlikely to be confounded by contamination. Ways to mitigate this this are to a) take multiple samples of CSF and b) be scrupulous in laboratory procedures.
  • The probability of a false negative is considered uniform for all alternative possibilities. In reality, some bugs are probably much more likely to falsely appear to have a different set of properties on a particular test. If there are any bugs that are “great imitators” on a few such tests, then small differences in probability would begin to aggregate, and this method might miss one of those bugs really badly.

Notes:

  • Raw data and computations can be found in this Google Spreadsheet.
  • R code for making the bubble plots is on GitHub.
  • Thanks to Mike Chary and Joe Lerman for help on this. All mistakes are mine.