Bayesian TSS Estimation

Abstract: A way to perform transcription start site estimates in bacteria using an informative prior for a particular bacterial pathogen. Incomplete, but somebody might find it useful. First Published: 12/28/12. Last Updated: 12/28/12.

De novo promoter searches are more likely to find the true motif when the location of a transcription start site (TSS) is more accurate, because this narrows down the search space of possible motif locations, thus increasing signal to noise. So, it is useful to perform accurate estimation of TSSs and to store metaknowledge about the accuracy of a particular estimate.

A TSS is typically assumed to have one “true” location per promoter at a particular nucleotide in the genome. Different RNA sequencing samples yield different estimates for where this location is.

So, we need to build a model that allows us to estimate its location by combining information from all of the samples. If we assume that each sample’s estimate deviates independently from its “true” location, this leads naturally to the assignment of a normal distribution for our likelihood.

Then, we use an informative prior distribution, based on an empirical distribution of the distance of TSSs to the translation start site from independent data in the organism of interest by a previous paper, which we fit using a generalized lambda distribution. Finally, we compute the posterior probability using Bayes’ theorem:

Here are some specific predictions that I made for one organism:

bayesian TSS

Here is the code. Note that this project is not complete. For example, I am currently setting the variance hyperparameter in the normal distribution somewhat arbitrarily. If you have any questions please email me.