Finding genes de novo with RNA- seq

Finding genes de novo with RNA-seq Graham Etherington Graham.Etherington@tsl.ac.uk

Today's topics • The basics – What is RNA-seq, alternative splicing. • Assembly techniques • Reference-based alignment • De-novo assembly • Expression analysis

Today's topics • Tutorials in Galaxy • Finding genes through transcript assembly • TopHat – Cufflinks • Expression analysis • Cuffcompare – Cuffdiff

RNA-seq – the basics • Genome of interest. • How many genes are there? • Are some novel? • Alternative spliced isoforms? • Are some transcripts more abundant than others? • Which genes are expressed under different environmental or biological conditions (e.g. lack of a nutrient, pathogen infection, etc)?

What is RNA-seq? Genome Genes Extract mRNA (transcribed genes) Sequence

RNA-seq basics - Alternative splicing

Reference-based Alignment • Use when a closely-related reference is available. • 3 steps • Use a splice-aware aligner (e.g. BLAT, TopHat) to align reads to a reference genome. • Cluster reads from each locus to build isoform De Bruijn graphs. • Traverse graph to resolve isoforms. Each different path through graph represents a potentially different isoform.

Alignment Seed and extend alignment (e.g. BLAST) Query ATCGCGTTACGATCCGTAA Find all occurrences of ‘ATCGCG’ ATCGCGGTCGTTAATCGCGCGTTCGATCGCGTTACGATCCGTAACGCACCATCGCGTTGC Target Seeds

Alignment Seed and extend alignment (e.g. BLAST) Query ATCGCGTTACGATCCGTAA Extend alignments Genome ATCGCGTTAGTTAATCGCGTTACCGATCGCGTTACGATCCGTAACGCACCATCGCGTTAA

Alignment • Burrow-Wheeler Transform (BWT) • used by BWA, SOAP, Bowtie (and TopHat) aligners • Creates a compressed index of the genome. • Index is a sorted range of substrings from genome that can be quickly searched. • Stretches of sequence can be looked-up • Like the index of a book. Words (sequences) can be looked up in index which then points you to the pages (genomic locations) were that word (sequence) is found. • Narrows-down the search space (searches index instead of genome) • Speeds up alignment and requires less memory when compared to older alignment algorithms.

Creating and Traversing Graphs Aligned reads Create graph that represents alternative splicing Traverse graph to find all possible paths All possible splice-variants from graph

Reference-based Alignment • Preferable where a high-quality reference exists. • Can assemble full-length transcripts at depth of 10x. • Advantages: • Contamination not a great problem – won’t align. • Less memory use than de novo assembly • Detection of low-abundance transcripts • Identify transcripts undiscovered in annotated reference

Reference-based Alignment • Disadvantages: • Relies on the accuracy of the reference sequence • May contain errors, deletions, missassemblies. • Can miss divergent transcripts • Reads often align to multiple regions • Excluding multi-mapped reads – leaves gaps • Randomly assign multi-mapped reads – false transcripts • Can’t easily assemble trans-spliced genes (2 pre-mRNAs spliced together to form 1 mature mRNA)

De-novo assembly • Doesn’t use a reference sequence. • Constructs De Bruijn graph by breaking reads into k-mers and connecting overlapping nodes. • Graph is traversed to identify paths through it. • Each path represents a unique sequence.

De Bruijn graphs • All substrings of length k (k-mers) are generated from each read. • 5-mers in this example

De Bruijn graphs • Overlapping k-mers used to create nodes in graph. • Chains of adjacent nodes in graph are collapsed into a single node • Alternative paths through graph are identified. • Isoforms identified

De-novo Assembly • Advantages • Doesn’t need a reference sequence. • Sometimes better than reference-based assembly when: • reference is of low quality (e.g. missing bits). • Unknown exogenous transcripts want to be detected. • Where long introns are expected. • Doesn’t depend on the correct alignment of reads to splice sites.

De-novo Assembly • Disadvantages: • Lots of data requires lots of RAM • Requires greater sequencing depth than reference-based assembly (30x cf 10x). • Highly similar transcripts are likely to be assembled into single transcripts. • Sensitive to read-errors. Hard to tell errors from low-abundance transcripts.

Expression analysis The more abundant an RNA, the more times it will be randomly selected for sequencing. The Cufflinks tool suite assembles transcripts and calculates their abundance. Sample 1 Gene A (control) Sample 2 Gene A (infected) expressed mRNA sequencing reads

Expression analysis • Use number of mapped reads as an indicator of expression. Map reads back to genome Sample 1 Gene A (control) Sample2 Gene A (infected) Differential expression

Normalisation • 2 sequence libraries can produce different volumes of data • transcript A present in same abundance in library X and library Y • library X produces 3 times more reads than library Y • transcript A in library X will appear 3 times more abundant. • Need some way to normalise the expression data. • Fragments Per Kilobase of exon, per Million fragments mapped (FPKM). • accounts for the number of reads in experiment, length of transcript and the number of reads aligning to it. • allows a comparisons between two datasets when there is considerably more data in one dataset than the other.

Tutorials • Go through the tutorial sheet. • The task: • Reference-based RNA-seq assembly using TopHatand Cufflinks in Galaxy. • RNA-seqexpression analysis using Cuffcompare and Cuffdiff in Galaxy. • http://galaxy.tsl.ac.uk

Finding genes de novo with RNA- seq

Finding genes de novo with RNA- seq

Presentation Transcript

RNA-Seq

RNA- seq Analysis

RNA- Seq Lab

Measuring transcriptomes with RNA-Seq

RNA seq (I)

Identifying differentially expressed genes from RNA- seq data

Le RNA-seq

Bioinformatics for DNA - seq and RNA- seq experiments

RNA seq analysis with reference genome

RNA- seq Analysis Practical Exercise

RNA-Seq and transcriptome analysis

RNA-seq data

De novo assembly of RNA

Finding Genes

De novo creation of new genes

RNA-Seq datasets

RNA- seq Analysis in Galaxy

RNA-SEQ