350 likes | 558 Views
Please DO NOT switch on your computers – yet. RNA- seq Analysis. Graham Etherington Sainsbury Laboratory Training Course http:// tsltraining.tsl.ac.uk /. Today's topics. The basics – What is RNA- seq , paired-end reads, alternative splicing Considerations before sequencing Library prep
E N D
Please DO NOT switch on your computers – yet. RNA-seq Analysis Graham Etherington Sainsbury Laboratory Training Course http://tsltraining.tsl.ac.uk/
Today's topics • The basics – What is RNA-seq, paired-end reads, alternative splicing • Considerations before sequencing • Library prep • What ‘contaminate’ RNA (rRNA, abundant transcripts) to remove and how. • Sequencing • Quality control • Assembly techniques • Reference-based alignment • De-novo assembly • Combined assembly (Align-then-assemble vs Assemble-then-align) • Choosing a strategy and a program • Expression analysis
Today's topics • Tutorials • Reference-based transcript assembly and expression analysis without annotation using Galaxy • TopHat – Cufflinks - Cuffmerge - Cuffdiff • De-novo assembly using Trinity
What is RNA-seq? Genome Genes Extract mRNA (expressed genes) Sequence mRNA Assemble into transcripts
RNA-seq basics - Paired-end reads • Sequences can be paired-end • sequences occur as ‘pairs’ with one left-hand (forward) read and one right-hand (reverse) read. • a given distance (insert-size) between the start and end of pairs. Paired -ends Left (forward) read 76 nucleotides Right (reverse) read 76 nucleotides 500 nt DNA fragment ~350 nt gap ~500 nt ‘insert size’
RNA-seq – the basics • Genome of interest. • How many genes (mRNAs) are there? • Are some novel? • Alternative spliced isoforms? • Which genes are expressed under different environmental conditions (cf microarrays)? • Are some expressed more than others?
Pre-sequencing • Library prep. • Multiple insert sizes captures both short and long transcripts plus alternative spliced isoforms • longer insert sizes offer long-range exon connectivity • Which RNA to select • poly-A tail RNA • misses ncRNA + rare mRNAs without poly-A tail • leave all RNAs in then remove rRNA by ‘hybridisation-based depletion methods’ • biases quantification of high-abundant transcripts • Strand-specific protocols • Aids assembly and quantification of overlapping transcripts from opposite strands
Post-sequencing • Quality control • LOTS of data – don’t worry about throwing a lot of it away • remove short/long reads • remove reads with Ns • remove PCR duplicates • remove/trim low-quality reads/regions • Remove low copy k-mers
Reference-based Alignment • Use when a closely-related reference is available. • 3 steps • Use a splice-aware aligner (e.g. BLAT, TopHat). • Cluster reads from each locus to build isoform graphs. • Traverse graph to resolve isoforms (e.g. Cufflinks, Scripture)
Splice-aware aligners • Two types- Seed & extend and BWT • Seed-and-extend SEED-part of read EXTEND alignment GGACG Reference ATGGACGTCATGTTC
Splice-aware aligners • Burrow-Wheeler transform (BWT) • Creates a compressed ‘index’ of the genome. • Stretches of sequence can be ‘looked-up’ • Narrows-down the search space • Speeds up alignment • Requires less memory
Reference-based Alignment • Applications: • Microbes and lower eukaryotic organisms. • Few introns and little alternative splicing • Use with strand-specific sequencing to identify overlapping genes.
Reference-based Alignment • Advantages: • Contamination not a great problem – won’t align. • Less memory use • Align low-abundance transcripts • Identify transcripts undiscovered in annotated reference
Reference-based Alignment • Disadvantages: • Relies on the accuracy of the reference sequence • May contain errors, deletions, missassemblies. • Can miss divergent transcripts • Reads often align to multiple regions • Excluding multi-mapped reads – leaves gaps • Randomly assign multi-mapped reads – false transcripts • Can’t easily assemble trans-spliced genes
Reference-based Alignment • Summary • Preferable where a high-quality reference exists. • Can assemble full-length transcripts at depth of 10x. • Can include longer reads (e.g . 454) to capture connectivity between more exons.
De-novo assembly • Doesn’t use a reference sequence. • Finds overlaps between reads and assembles them into contigs/transcripts. • Constructs De Bruijn graph which breaks reads into k-mers and connects overlapping nodes.
De Bruijn graphs All substrings of length k (k-mers) are generated from each read. De Bruijn graph created by kmers that overlap by k–1. Single-nucleotide differences cause 'bubbles' of length k in the De Brujingraph Insertions or deletions introduce a shorter path in the graph. Collapse adjacent nodes. Calculate paths through graph. Isoforms.
De-novo Assembly • Applications: • Microbes and lower eukaryotic organisms. • Yeast transcriptomes can be assembled with >30x coverage. • Overlapping genes from opposite strands can be detected by not allowing reverse complements in De Bruijn graph and using odd k-mers. • Higher eukaryotes more challenging due to larger datasets and difficulties in identifying alternative splice sites.
De-novo Assembly • Advantages • Doesn’t need a reference sequence. • Sometimes better than reference-based assembly when: • reference is of low quality (e.g. missing bits). • Unknown exogenous transcripts want to be detected. • Where long introns are expected. • Doesn’t depend on the correct alignment of reads to splice sites.
De-novo Assembly • Disadvantages: • With higher eukaryotic datasets needs lots of RAM • Requires higher sequencing depth than reference-based assembly (30x cf 10x). • Highly similar transcripts are likely to be assembled into single transcripts. • Sensitive to read-errors. Hard to tell errors from low-abundance transcripts.
Combined strategy • Use both de-novo assembly and reference-based alignment methods to get the best results. • Two techniques: • Align-then-assemble • Assemble-then-align • Make use of sensitivity of reference-based aligners and use de-novo assembly for novel sequences.
Combined strategy • Align-then-assemble • Most intuitive. • Align reads to a reference. • What doesn’t align – de-novo assemble.
Combined strategy • Assemble-then-align • When quality of reference genome is suspect. • When reference genome is from distantly related species. • De-novo assemble into contigs first. • Then use reference to extend contigs into longer transcripts. • Small errors in the reference genome don’t get propagated into the new assembly.
Choosing a strategy • Factors to consider • Reference genome available? • Good quality? • Closely-related species? • Aim of project • Annotation • Identify novel transcripts • Expression analysis
Expression analysis The more abundant an RNA, the more times it will be randomly selected for sequencing. Gene 1 Condition A Gene 1 Condition B expressed mRNA sequencing Reads
Expression analysis • Use No. of mapped reads as an indicator of expression. Map reads back to genome Gene 1 Condition A Gene 1 Condition B
Expression analysis • Need some way to normalise the expression data. • Fragments Per Kilobase of exon per Million fragments mapped (FPKM). • Some controversy over this approach – bias for longer transcripts.
Tutorials • Switch on your computers and boot into Windows. • Log-in using the yellow username on your machine. • Go through the tutorial sheet. • There are two tasks, both using Galaxy: • Reference-based transcript assembly and expression analysis without annotation using Galaxy • TopHat – Cufflinks - Cuffmerge - Cuffdiff • De-novo transcript assembly using Trinity. • Take your time during the tutorials and make sure you understand what you are doing. • Please delete your Galaxy analysis when finished.
Tutorials • Logging on to your computers: • Use the name given on the yellow sticker on your machine. • Password: Learning26 • Logging into Galaxy • Go to http://galaxy.tsl.ac.uk • machine_name@nbi.ac.uk (e.g. b26stu10@nbi.ac.uk) • Password: Learning26