How to run kallisto on NCBI SRA RNA-Seq data for differential expression using the mac terminal

Attention Conservation Notice: This post explains how to run the exceptionally fast RNA-seq k-mer aligner kallisto from the Pachter lab on data you download from NCBI’s Short Read Archive, and then analyze it for differential expression using voom/limma. As with everything in bioinformatics, this will likely be obsolete in months, if not weeks.


Kallisto is really fast and non-memory intensive, without sacrificing accuracy (at least, according to their paper), and therefore has the potential to make your life a lot easier when it comes to analyzing RNA-seq data.

As a test data set, I used the very useful SRA DNA Nexus to search the SRA database for a transcriptome study from human samples with 2-3 biological replicates in 2 groups, so that I could achieve more robust differential expression calls without having to download too much data.

I ended up using SRP006900, which is titled “RNA-seq of two paired normal and cancer tissues in two stage III colorectal cancer patients.” I actually wasn’t able to find a study describing it for differential expression analysis in a 3-5 min search, but since it was public almost four years ago, I’m sure that it has been analyzed in such a manner. It is single-end reads with an average of 65 base pairs per read.

In order to download this data set (which totals ~ 10 GB over its four files, in .fastq format) from the SRA via the command line, I used the SRA toolkit. You’ll want to make sure this is downloaded correctly by running a test command such as the following on the command line:

$PATH/sratoolkit.2.5.0-mac64/bin/fastq-dump -X 5 -Z SRR390728 

You’ll also want to download kallisto. In order to install it, I copied it to my usr/local/bin/ folder to make it globally executable, as described on the kallisto website.

You also need to download the fasta file for your transcriptome of interest and create an index file that kallisto can work on with. You can download the hg38 reference transcriptome from the kallisto website or from UCSC.

For pre-reqs, it is helpful to have homebrew installed as a package manager. If you do (or once you do), these three commands did the trick for me:

brew update
brew install cmake
brew install homebrew/science/hdf5

Next you’ll need to actually download the data from each run, convert them to .fastq, and then run kallisto on them. For extensibility, I did this in the following shell script.

First, a config file with the sample names and average read length (will be needed for kallisto if the data is single-end, in my experience):

declare -a arr=("SRR222175" "SRR222176" "SRR222177" "SRR222178")
AVG_READ_LENGTH=65

(Update 5/13/15: The average read length should be average fragment length, which I’m working on figuring out the best way of estimating from single-end reads (if one is possible). In the meantime, it may be easier to choose a paired-end read sample from SRA instead, from which the average fragment length can be estimated by kallisto. Thanks to Hani Goodarzi for the helpful pointer that I was using the wrong parameter.)

Next, I use a shell script to execute the relevant commands on these sample names to build your abundance.txt files. Note that this data set has single-end reads, so you only call one .fastq file, rather than the two in their example. I put each output in a file called output_$RUNNAME that I can call from R for downstream analyses. Here’s the full shell script (also github):

#path to where the SRA bin folder is located; could eliminate by making SRAT globally executable 
PATH_SRA_DIR="your_path_here/"

#go to where your kallisto test directory is found, if necessary
echo "cd your_path_here"

#load the config file
source test_kallisto/config_SRP006900.file

#download the data 
for i in "${arr[@]}"
do
   "$PATH_SRA_DIR"sratoolkit.2.5.0-mac64/bin/fastq-dump --accession $i --outdir test_kallisto
done

#create the intranscripts index file 
kallisto index -i transcripts.idx transcripts.fasta.gz

#create output directories, and then run kallisto for each of the runs
for i in "${arr[@]}"
do
	mkdir output_"$i"
	kallisto quant -i transcripts.idx -o output_"$i" -l "$AVG_READ_LENGTH" "$i"".fastq"
done

Once you have these files, you can analyze them in R using limma. Here’s how I went about it. First, read in the files and merge the transcripts per million (tpm) calls into one data frame:

library(limma)

#samples names of downloaded from SRA 
sample_list_n = c("SRR222175", "SRR222176", "SRR222177", "SRR222178")

for(i in 1:length(sample_list_n)){
	tmp = read.table(file = paste0("test_kallisto/output_", 
		sample_list_n[i], "/abundance.txt"), header = TRUE) 
	assign(sample_list_n[i], tmp)
}

sample_list = mget(sample_list_n)\

#give the list unique names 
sample_list_uni = Map(function(x, i) setNames(x, ifelse(names(x) %in% "target_id",
      names(x), sprintf('%s.%d', names(x), i))), sample_list, seq_along(sample_list))

full_kalli = Reduce(function(...) merge(..., by = "target_id", all=T), sample_list_uni)

tpm_vals = full_kalli[, grep("tpm", names(full_kalli))]
rownames(tpm_vals) = full_kalli$target_id

Then make a contrasts matrix for normal vs control samples:

groups = c("normal", "colon_cancer", "normal", "colon_cancer")
condition <- model.matrix(~0 + groups)
colnames(condition) = c("normal", "colon_cancer")
cont_matrix = makeContrasts(norm_canc = normal - colon_cancer, levels = condition)

Then compute differential expression using that contrast matrix:

#this spits out an EList object 
v = voom(counts = tpm_vals, design = condition)
fit = lmFit(v, condition)
fit = contrasts.fit(fit, cont_matrix)
fit = eBayes(fit)
top_table = topTable(fit, n = 10000, sort.by = "p")

And that’s it. Interestingly, the transcript called as the #1 most differentially expressed using this method, ATP5A1, has been previously implicated as a marker for colon cancer.

Update 5/13/15: Fixed two typos, switched one link, and switched “genome” to “transcriptome” (I was sloppy previously) per @pmelsted’s useful comment on twitter.

Update 8/10/15: As Seb Battaglia points out on twitter, “voom assumes you cut out out 0s and low intensity tpm. I’d add that step or data hist() will be skewed.” I don’t have the opportunity to re-run the analyses right now (!), but please take this into consideration.

Advertisements