Aligning Short Reads to References with Bowtie, Tophat, and Cufflinks


Provided are directions for how to convert the sequence output of next-generation sequencers, especially those of Illumina, to measures of differential expression, and for how to discover splice forms, using the suite of tools beginning with Bowtie, Tophat, and Cufflinks.   These program are available on lewis and one needs an account on that system to use these tools.   We provide scripts for how to submit jobs to lewis, and only what is needed in the way of details of these three packages.   For those details go to the pertinent web sites.

1.  Create the indexed reference with Bowtie


If there is not already a pre-built reference index for your organism (see the list on the Bowtie site) you will need to build it yourself using the bowtie-build command.   If there is a reference already made copy the reference into a place under your data directory on lewis then skip to step 2 below.

Bowtie is designed to align short DNA sequences such as Illumina and SOLiD reads to a reference genome.   Here we use Bowtie to provide indexed references for use by Tophat and Cufflinks.
Three bowtie commands are available from the command line of lewis:

bowtie   performs the alignment of query sequence with the indexed reference
bowtie-build   indexes the reference sequences
bowtie-inspect   recreates reference strings from Bowtie index
The options for each of these can be displayed by typing in the command and hitting return.

Indexing the reference on Lewis

We will use maize to illustrate how to create an index of the reference sequences against which to align your Illumina reads.   Create a directory ("maize") where you store the fasta-format reference MaizeGDNA.fa and the database.    Write a script ("CreateGDNAdb") like this one:
#BSUB -J CreateGDNAdb
#BSUB -q multi
#BSUB -o CreateGDNAdb.o%J
#BSUB -e CreateGDNAdb.e%J
#BSUB -n 4
#BSUB -R "span[hosts=1];rusage[mem=5500]"

bowtie-build MaizeGDNA.fa MaizeGDNA
MaizeGDNA.fa is the input and MaizeGDNA is the prefix used in the output files.  To submit this job type
bsub < CreateGDNAdb
A 2.1 GB file of genomic DNA took ~5.5 hrs to finish.  
The bowtie-build program produces 6 files each with suffix of "ebwt".   Test that the index is installed correctly with a sequence known to be in the reference genome:
bowtie -c MaizeGDNA ATGACGGCGGCCGAGAAGTTGGATTTCTTGGTAAACCAGGTCACATCCTTGACTGATCTGGGGACGAAATTGT
If installed properly, this command should print a single alignment and then exit.  If you have trouble building the index consult the manual for some possible options and details.

2.   Align RNA-Seq reads to the reference with Tophat


Tophat aligns RNA-Seq reads to a genome and then identifies splice junctions between the exons found.  Fifteen different Tophat programs are available and at present are all to be found under the path /share/apps/tophat-1.0.13/bin/.  The options for each of these can be displayed by typing in the command followed by --help and hitting return.  Note that TopHat requires a fasta file (.fa) for your reference and will create the file from the Bowtie index if it is not already present in the same directory as the index.  Creation of this fasta file may take an hour for larger genomes so save time by including the fasta file along with the index.  The read sequence files may be either fasta for fastq (not a mix of the two).   Fastq is preferred.

Run Tophat

The command to run Tophat looks like this:
tophat [options] <bowtie_index> <reads1[,reads2,...,readsN]> [reads1[,reads2,...,readsN]]

Note that there are two sets of read files grouped by either "<>" or "[]".   The first set of reads are obtained from a single end run, or, are the first members of a paired-end run.  The second set of reads are the second member of the paired-end run.  For a paired end run the first and second members must be in the same order.  Note the use of commas between the read files within each set.   Without them the program will assume the first and second files are members of a read pair.

The following script, RunTophatOnMaizeGenome, could be used to submit the job to LSF on lewis:
#BSUB -J RunTophatOnMaizeGenome
#BSUB -o RunTophatOnMaizeGenome.o%J
#BSUB -e RunTophatOnMaizeGenome.e%J
#BSUB -n 4
#BSUB -R "span[hosts=1];rusage[mem=5500]"

tophat -p 4 --solexa1.3-quals MaizeGDNA s7.reads,s8.reads
Note that 4 cores are indicated by both the "-p 4" option and the "#BSUB -n 4" statement.   The "--solexa1.3-quals" flag tells it to use the Solexa pipeline version 1.3 scale for quality scores in the fastq file.

This particular run took about 5.5 hrs using two single end Illumina fastq files of ~2.4 GB each (16 million 42-base reads in each).



3.   Cufflinks to test for differential expression


Cufflinks "assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples."   The input to cufflinks is a SAM-formatted text file.   This file must be sorted by the reference position (chromosome, then start position in that chromosome).   If you have used Tophat to create this file it is already sorted correctly.   Otherwise see the Cufflinks site for directions on how to sort it.

Here is a script that will execute cufflinks:

#BSUB -J RunCufflinksOnMaizeRNASeq
#BSUB -o RunCufflinksOnMaizeRNASeq.o%J
#BSUB -e RunCufflinksOnMaizeRNASeq.e%J
#BSUB -n 4
#BSUB -R "span[hosts=1];rusage[mem=5500]"

cufflinks -p 4 aligned_reads.sam
where the aligned_reads.sam file is one of the outputs of tophat that is in the SAM format.   Note that the -p flag and the #BSUB -n 4 line are again referring to the number of cores to use and that the maximum is 4.

Cufflinks produces three output files, transcripts.gtf, transcripts.expr, and genes.expr.   The cuffcompare program takes multiple .gtf files as input to track changes in transcripts across experiments, and can be used to compare your transcripts to a reference genome.