270 likes | 340 Views
Getting the computer setup. Follow directions on handout to login to server. Type “ qsub - I ” to get a compute node. The data you will be using is stored in ../shared/. Mapping RNA-seq data. Matthew Young Alicia Oshlack Bernie Pope. Illumina Sequencing Technology. 3’. 5’. G. C. T.
E N D
Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/
Mapping RNA-seq data Matthew Young Alicia Oshlack Bernie Pope
Illumina Sequencing Technology 3’ 5’ G C T A C G T A C T A C C 5’ C C G A T A A A C G T T T A T G G G C 1 2 3 4 5 6 7 8 9 Base calling TGCTACGAT … DNA(0.1-1.0 ug) Sample preparation Single molecule array Cluster growth Sequencing Image acquisition Slide courtesy of G Schroth, Illumina
Raw data • Short sequence reads • Quality scores = -10log10(p) or similar… @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Handout Reference: Page 2
Quality checks • Base composition • Quality score • PCR Artifacts
Sequencing transcripts not the genome Reads Gene CDS CDS CDS CDS CDS CDS transcript CDS CDS Difficulty here is that reads spanning a exon-exon junction may not get mapped when mapping to the genome. One strategy: Supplement the reference genome with sequences that span all known or possible junctions.
CDS CDS CDS CDS Exons Coding Sequence Introns Splice Junctions • Aggregate reads based on: • exons • Exons + junctions • All reads start to end of transcript • De novo methods
Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/
Familiarizing yourself with bowtie • The minimum information bowtie needs is your reads (i.e., the fastq file from the machine) and a reference. • The reference is the genome or transcriptome we are trying to map to, transformed using a Burrows Wheeler Transform to allow fast searching. • Many optional parameters to tweak alignment. Handout Reference: Pages 2-7
How do you map 10^9, 76bp sequences, to a 10^9 bp reference • Ideally we’d test every position on the genome for its suitability as a match and assign it a score based on # mismatches, indels etc. • However, this is computationally impossible, so we have to come up with something else. Handout Reference: Pages 2-7
Lots of aligners, one general strategy • Quick “heuristic” is performed to cut down the number of candidate alignment regions for each read. • More precise algorithm is employed to decide which of these candidates is a valid alignment. Handout Reference: Pages 2-7
A fragment as seen by an aligner • Heuristic acts only on the seed. Putative mapping locations identified. • A more precise algorithm extends each seed to the full read and ranks them. • The fragment is not sequenced directly, it’s sequence is inferred based on mapping of the read. ; Seed Burrows Wheeler acts on this Smith-Waterman acts on this Read Fragment Handout Reference: Pages 2-7
Our test data set • 76bp single end reads. • Sequenced using the Illumina Genome Analyzer • RNA taken from Mouse myoblast cell line. • We only look at a subset of one lane, full data available here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2084
Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/
A strict mapping bowtie -t -p 1 -n 0 -e 1 --sam --best --un Strict_Unmapped.fastq ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastqStrict_Mapped.sam • -n sets the number of mismatches in the seed • The sum of quality scores at ALL mismatches (not just the seed) must be less than -e. • --un saves the unmapped reads to the file specified • -t prints timing information • -p sets the number of simultaneous threads • --sam makes bowtie output in SAM format • --best ensures the best map is returned Handout Reference: Page 8
Using the defaults bowtie -t -p 1 --best --sam ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastqtest.sam • -t prints timing information • -p sets the number of simultaneous threads • --sam makes bowtie output in SAM format • --best ensures the best map is returned • --un saves the unmapped reads to the file specified Handout Reference: Page 8
Allowing some mismatches Bowtie -t -p 1 -n 3 -e 200 --sam –best --un Loose_Unmapped.fastq ../shared/BowtieIndexes/mm9Strict_Unmapped.fastqLoose_Mapped.sam • Set the number of seed mismatches to the maximum. • Set -e to a value more appropriate for our read length. • How many reads do you map? Handout Reference: Pages 8-9
A closer look at our data set: fastQC Handout Reference: Pages 9-10
Trimming reads • Bowtie allows you to trim reads before it attempts to map them. • You can trim from the left (5’) end of the read with the --trim5 option. • You can trim from the right (3’) end of the read with the --trim3 option. Handout Reference: Page 10
Sequencing transcripts not the genome Reads Gene CDS CDS CDS CDS CDS CDS transcript CDS CDS A library containing these bits of sequence (which do not appear in the genome) can help map junction reads. This is called an exon-junction library. A reference built from this library is in BowtieIndexes, called mm9.UCSC.knownGene.junctions (named for the annotation it was built from). Handout Reference: Pages 10-12
Options to map more reads… • Trim some bases from the end of the reads using --trim5 and/or --trim3. • Map to the junction library instead of the mouse genome using the mm9.UCSC.knownGene.junctions index. Handout Reference: Pages 9-12
Trim [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 2 -e 70 --trim5 15 --trim3 25 --sam --best --un Trimmed_Unmapped.fastq ../shared/BowtieIndexes/mm9 Loose_Unmapped.fastqTrimmed_Mapped.sam Time loading forward index: 00:00:02 Time loading mirror index: 00:00:02 Seeded quality full-index search: 00:00:47 # reads processed: 531414 # reads with at least one reported alignment: 164256 (30.91%) # reads that failed to align: 367158 (69.09%) Reported 164256 alignments to 1 output stream(s) Time searching: 00:00:53 Overall time: 00:00:54 Handout Reference: Page 10
Then map to junctions [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 3 -e 200 --sam --best --un Junctions_Unampped.fastq ../shared/BowtieIndexes/mm9.UCSC.knownGene.junctions Trimmed_Unmapped.fastqJunctions_Mapped.sam Time loading forward index: 00:00:35 Time loading mirror index: 00:00:34 Seeded quality full-index search: 00:00:24 # reads processed: 367158 # reads with at least one reported alignment: 128756 (35.07%) # reads that failed to align: 238402 (64.93%) Reported 128756 alignments to 1 output stream(s) Time searching: 00:01:43 Overall time: 00:01:45 Handout Reference: Page 11
Number of mapped reads Handout Reference: Page 12
Further options Handout Reference: Pages 12-13