RNA-seq Data Mapping with Bowtie: Setup and Best Practices

Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/

Mapping RNA-seq data Matthew Young Alicia Oshlack Bernie Pope

Illumina Sequencing Technology 3’ 5’ G C T A C G T A C T A C C 5’ C C G A T A A A C G T T T A T G G G C 1 2 3 4 5 6 7 8 9 Base calling TGCTACGAT … DNA(0.1-1.0 ug) Sample preparation Single molecule array Cluster growth Sequencing Image acquisition Slide courtesy of G Schroth, Illumina

Raw data • Short sequence reads • Quality scores = -10log10(p) or similar… @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Handout Reference: Page 2

Quality checks • Base composition • Quality score • PCR Artifacts

Sequencing transcripts not the genome Reads Gene CDS CDS CDS CDS CDS CDS transcript CDS CDS Difficulty here is that reads spanning a exon-exon junction may not get mapped when mapping to the genome. One strategy: Supplement the reference genome with sequences that span all known or possible junctions.

CDS CDS CDS CDS Exons Coding Sequence Introns Splice Junctions • Aggregate reads based on: • exons • Exons + junctions • All reads start to end of transcript • De novo methods

Mapping tools

Familiarizing yourself with bowtie • The minimum information bowtie needs is your reads (i.e., the fastq file from the machine) and a reference. • The reference is the genome or transcriptome we are trying to map to, transformed using a Burrows Wheeler Transform to allow fast searching. • Many optional parameters to tweak alignment. Handout Reference: Pages 2-7

How do you map 10^9, 76bp sequences, to a 10^9 bp reference • Ideally we’d test every position on the genome for its suitability as a match and assign it a score based on # mismatches, indels etc. • However, this is computationally impossible, so we have to come up with something else. Handout Reference: Pages 2-7

Lots of aligners, one general strategy • Quick “heuristic” is performed to cut down the number of candidate alignment regions for each read. • More precise algorithm is employed to decide which of these candidates is a valid alignment. Handout Reference: Pages 2-7

A fragment as seen by an aligner • Heuristic acts only on the seed. Putative mapping locations identified. • A more precise algorithm extends each seed to the full read and ranks them. • The fragment is not sequenced directly, it’s sequence is inferred based on mapping of the read. ; Seed Burrows Wheeler acts on this Smith-Waterman acts on this Read Fragment Handout Reference: Pages 2-7

Our test data set • 76bp single end reads. • Sequenced using the Illumina Genome Analyzer • RNA taken from Mouse myoblast cell line. • We only look at a subset of one lane, full data available here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2084

A strict mapping bowtie -t -p 1 -n 0 -e 1 --sam --best --un Strict_Unmapped.fastq ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastqStrict_Mapped.sam • -n sets the number of mismatches in the seed • The sum of quality scores at ALL mismatches (not just the seed) must be less than -e. • --un saves the unmapped reads to the file specified • -t prints timing information • -p sets the number of simultaneous threads • --sam makes bowtie output in SAM format • --best ensures the best map is returned Handout Reference: Page 8

Using the defaults bowtie -t -p 1 --best --sam ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastqtest.sam • -t prints timing information • -p sets the number of simultaneous threads • --sam makes bowtie output in SAM format • --best ensures the best map is returned • --un saves the unmapped reads to the file specified Handout Reference: Page 8

Allowing some mismatches Bowtie -t -p 1 -n 3 -e 200 --sam –best --un Loose_Unmapped.fastq ../shared/BowtieIndexes/mm9Strict_Unmapped.fastqLoose_Mapped.sam • Set the number of seed mismatches to the maximum. • Set -e to a value more appropriate for our read length. • How many reads do you map? Handout Reference: Pages 8-9

A closer look at our data set: fastQC Handout Reference: Pages 9-10

Handout Reference: Pages 9-10

Trimming reads • Bowtie allows you to trim reads before it attempts to map them. • You can trim from the left (5’) end of the read with the --trim5 option. • You can trim from the right (3’) end of the read with the --trim3 option. Handout Reference: Page 10

Sequencing transcripts not the genome Reads Gene CDS CDS CDS CDS CDS CDS transcript CDS CDS A library containing these bits of sequence (which do not appear in the genome) can help map junction reads. This is called an exon-junction library. A reference built from this library is in BowtieIndexes, called mm9.UCSC.knownGene.junctions (named for the annotation it was built from). Handout Reference: Pages 10-12

Options to map more reads… • Trim some bases from the end of the reads using --trim5 and/or --trim3. • Map to the junction library instead of the mouse genome using the mm9.UCSC.knownGene.junctions index. Handout Reference: Pages 9-12

Trim [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 2 -e 70 --trim5 15 --trim3 25 --sam --best --un Trimmed_Unmapped.fastq ../shared/BowtieIndexes/mm9 Loose_Unmapped.fastqTrimmed_Mapped.sam Time loading forward index: 00:00:02 Time loading mirror index: 00:00:02 Seeded quality full-index search: 00:00:47 # reads processed: 531414 # reads with at least one reported alignment: 164256 (30.91%) # reads that failed to align: 367158 (69.09%) Reported 164256 alignments to 1 output stream(s) Time searching: 00:00:53 Overall time: 00:00:54 Handout Reference: Page 10

Then map to junctions [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 3 -e 200 --sam --best --un Junctions_Unampped.fastq ../shared/BowtieIndexes/mm9.UCSC.knownGene.junctions Trimmed_Unmapped.fastqJunctions_Mapped.sam Time loading forward index: 00:00:35 Time loading mirror index: 00:00:34 Seeded quality full-index search: 00:00:24 # reads processed: 367158 # reads with at least one reported alignment: 128756 (35.07%) # reads that failed to align: 238402 (64.93%) Reported 128756 alignments to 1 output stream(s) Time searching: 00:01:43 Overall time: 00:01:45 Handout Reference: Page 11

Number of mapped reads Handout Reference: Page 12

Further options Handout Reference: Pages 12-13

RNA-seq Data Mapping with Bowtie: Setup and Best Practices

RNA-seq Data Mapping with Bowtie: Setup and Best Practices

Presentation Transcript

Computer Basics 1: Getting Started

Getting the Map into the Computer

Getting To Know Today’s Computer !!

The Setup

Getting Setup to use the ORGN

Getting the Map into the Computer

Getting Started with Computer Graphics

Getting Started on the Computer

Computer/Projector/Document Camera Setup

Best Computer Setup Service Provider:Easytechy

Computer Protection | Norton Com Setup | Setup Norton

New computer setup Manhasset

Choose the www.office.com/setup on Computer

NORTON.COM/SETUP NORTON ANTIVIRUS FOR COMPUTER

How to Activate Norton Setup In Computer at norton.com/setup

Norton Setup - Computer Service

How to Activate Norton Setup In Computer at norton.com/setup

Norton.com/setup - Install Norton setup on your computer

The Best Recommendation On Getting A Computer

The Best Advice On Getting A Desktop Computer

Mcafee.com setup- Best computer antivirus software

Office.com/setup - Install Office Setup in Windows/Mac Computer